CN109447103B

CN109447103B - Big data classification method, device and equipment based on hard clustering algorithm

Info

Publication number: CN109447103B
Application number: CN201811044932.0A
Authority: CN
Inventors: 金戈; 徐亮; 肖京
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-09-07
Filing date: 2018-09-07
Publication date: 2023-09-29
Anticipated expiration: 2038-09-07
Also published as: CN109447103A

Abstract

The application discloses a big data classification method, a device and equipment based on a hard clustering algorithm, wherein the method comprises the following steps: acquiring data information, and dividing the data information into N pieces of sample data; performing primary hard cluster analysis on each sample data to determine N.xK1 primary cluster centers; performing secondary hard cluster analysis on the N x K1 primary cluster centers to determine K2 secondary cluster centers; dividing the data information into K2 classification items according to the K2 secondary clustering centers, and storing each classification item and the corresponding data information in a database. Through the scheme, the accuracy of the obtained secondary clustering center is higher, the classifying effect according to the secondary clustering center is better, and each obtained classifying item can have the characteristic of being clear, so that a user can better distinguish each classifying item and cannot be confused.

Description

Big data classification method, device and equipment based on hard clustering algorithm

Technical Field

The present application relates to the field of data analysis technologies, and in particular, to a method, an apparatus, and a device for classifying big data based on a hard clustering algorithm.

Background

Some companies are developing more and more rapidly, and for companies with a relatively high number of employees, people analysis needs to be performed on the employees, so that the employees are classified.

At present, a clustering algorithm is generally adopted to classify the obtained crowd data, the characteristics of people in different categories are classified, and the crowd is analyzed according to the classification result, for example, market analysts can be helped to distinguish different consumer groups from a consumer database, and the consumption mode or habit of each type of consumer is summarized.

The most commonly used clustering algorithm is a K-means algorithm, however, the existing K-means method is used for randomly selecting a clustering center, if the clustering center is not properly selected, the clustering effect is poor, and the obtained classification result is not accurate enough.

Disclosure of Invention

In view of the above, the application provides a big data classification method, a device and equipment based on a hard clustering algorithm. The method mainly aims to solve the technical problems that the existing k-means method is used for randomly selecting a clustering center, if the clustering center is not properly selected, the clustering effect is poor, and the obtained classification result is not accurate enough.

According to a first aspect of the present application, there is provided a big data classification method based on a hard clustering algorithm, the steps of the method comprising:

acquiring data information, and dividing the data information into N pieces of sample data, wherein N is more than or equal to 1;

performing primary hard cluster analysis on each sample data to determine N x K1 primary cluster centers, wherein K1 is the number of the primary cluster centers determined in the primary hard cluster analysis, and K1 is more than or equal to 1;

performing secondary hard cluster analysis on the N.times.K1 primary cluster centers to determine K2 secondary cluster centers, wherein K2 is the number of the secondary cluster centers determined in the secondary hard cluster analysis, and K2 is more than or equal to 1;

dividing the data information into K2 classification items according to the K2 secondary clustering centers, and storing each classification item and the corresponding data information in a database.

According to a second aspect of the present application, there is provided a big data classification device based on a hard clustering algorithm, the device comprising:

the acquisition unit is used for acquiring data information and dividing the data information into N pieces of sample data, wherein N is more than or equal to 1;

the cluster analysis unit is used for carrying out primary hard cluster analysis on each sample data to determine N x K1 primary cluster centers, wherein K1 is the number of the primary cluster centers determined in the primary hard cluster analysis, and K1 is more than or equal to 1;

the cluster analysis unit is also used for carrying out secondary hard cluster analysis on the N x K1 primary cluster centers to determine K2 secondary cluster centers, wherein K2 is the number of the secondary cluster centers determined in the secondary hard cluster analysis, and K2 is more than or equal to 1;

and the classification unit is used for dividing the data information into K2 classification items according to the K2 secondary clustering centers and storing each classification item and the corresponding data information in a database.

According to a third aspect of the present application, there is provided a computer device comprising a memory storing a computer program and a processor implementing the steps of the hard clustering algorithm based big data classification method of the first aspect when the computer program is executed.

According to a fourth aspect of the present application, there is provided a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the hard clustering algorithm based big data classification method of the first aspect.

By means of the technical scheme, the big data classification method, the device and the equipment based on the hard clustering algorithm can divide a large amount of data information into N sample data, then perform primary clustering analysis on each sample data by using the hard clustering algorithm to obtain N.times.K1 primary clustering centers, then perform secondary clustering analysis on the N.times.K1 primary clustering centers by using the hard clustering algorithm to obtain K2 secondary clustering centers, the accuracy of the obtained secondary clustering centers is higher, the classification effect according to the secondary clustering centers is better, and each obtained classification item has a clear characteristic, so that a user can better distinguish each classification item and cannot be confused.

The foregoing description is only an overview of the present application, and is intended to be implemented in accordance with the teachings of the present application in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present application more readily apparent.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 is a flow chart of one embodiment of a hard clustering algorithm-based big data classification method of the present application;

FIG. 2 is a block diagram illustrating an embodiment of a hard clustering algorithm based big data classification device according to the present application;

fig. 3 is a schematic structural diagram of a computer device according to the present application.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The embodiment of the application provides a big data classification method based on a hard clustering algorithm, which is characterized in that the obtained clustering center is more accurate by performing hard clustering on data information twice or more, so that the classification according to the clustering center is more accurate.

As shown in fig. 1, the embodiment of the application provides a big data classification method based on a hard clustering algorithm, which comprises the following steps:

and 101, acquiring data information, and dividing the data information into N pieces of sample data, wherein N is more than or equal to 1.

In this step, data information of company staff, company clients or other groups of people is acquired, the data information including: gender, age, hobbies, height, weight, income, etc. And carrying out cluster analysis according to the data information, and dividing corresponding categories.

And 102, performing primary hard clustering (K-means) analysis on each sample data, and determining N X K1 primary clustering centers, wherein K1 is the number of the primary clustering centers determined after the primary hard clustering analysis of each sample data, and K1 is more than or equal to 1.

In this step, a corresponding numerical value is assigned to each data information.

For example, gender: the first number is 1, the numerical value of a male in the second number is 1, and the numerical value of a female is 2;

age: the first digit is 2, and the age number is directly placed behind the first digit;

hobbies: the first digit is 3, and a corresponding second digit is set for various different hobbies;

height of the body: the first digit is 4, and the corresponding height value (unit cm) is placed behind the first digit;

weight of: the first digit is 5, and the corresponding weight value (unit kg) is placed behind the first digit;

income: the first digit is 6, and the corresponding revenue value (unit cell) is placed after the first digit.

These data information were taken as samples and divided into N pieces of sample data.

And then acquiring the input time of each data message, and establishing a coordinate system O1 by taking the input time of the data message as a horizontal axis and taking the numerical value of the data message as a vertical axis. And performing primary clustering analysis on each sample data by using a hard clustering (K-means) algorithm through the coordinate system O1, so as to obtain N x K1 primary clustering centers.

Wherein each primary cluster center corresponds to one data message.

And 103, performing secondary hard cluster analysis on the N.times.K1 primary cluster centers to determine K2 secondary cluster centers, wherein K2 is the number of the secondary cluster centers determined in the secondary hard cluster analysis, and K2 is more than or equal to 1.

And reestablishing a coordinate system O2 by taking the input time of data information corresponding to the N x K1 primary clustering centers as a horizontal axis and the numerical value as a vertical axis. And performing secondary cluster analysis on the N.times.K1 primary cluster centers by using a K-means algorithm to obtain K2 secondary cluster centers. Similarly, each secondary clustering center corresponds to one piece of data information, and the data information is obtained by secondary clustering analysis from the data information corresponding to the primary clustering center.

And 104, dividing the data information into K2 classification items according to the K2 secondary clustering centers, and storing each classification item and the corresponding data information in a database.

In this step, K2 secondary cluster centers obtained through twice K-means cluster analysis are mapped to coordinates of data information in the coordinate system O1 established in step 102. And dividing each data information on the coordinate system O1 into K2 areas by taking the secondary clustering center as a center, and respectively corresponding to different classification items. The specific classification items are obtained according to the final clustering result. For example, the classification items obtained are: high business capability class, low business capability class, good at communication class, character open class, etc.

And then arranging the data information in each classified item and the data information according to the sequence of the input time of the data information, and storing the list in a database after the data information and the data information are associated and corresponding to the personal information of each person. Thus, the user can search the crowd of the classification items needed by the user from the database.

For example, if the lead of the insurance company wants to find the insurance agent of the high business capability class from the insurance agents and rewards the insurance agent, the high business capability class is directly found from the database, and the personal information of all the insurance agents of the high business capability class is called, and the personal information is displayed to the lead of the insurance company. And the company leader or other people can summarize the characteristics of the company staff or clients according to the classified categories.

Through the technical scheme, the clustering analysis is carried out by utilizing the twice hard clustering algorithm, so that a relatively accurate clustering center can be obtained, the classifying effect according to the clustering center is better, and the distinction between each classified item is more obvious. Thus, the user can analyze the characteristics of the crowd according to the obtained classification result, or call the crowd needed by the user from the classification result.

Step 104 specifically includes:

step 1041, determining whether the number K2 of secondary clustering centers is greater than or equal to a set threshold, if yes, entering step 1042, and if no, entering step 1043.

And step 1042, performing hard cluster analysis on the K2 secondary cluster centers again until the number of the determined final cluster centers is smaller than a set threshold.

Step 1043, dividing the data information into K2 classification items according to the K2 secondary clustering centers, and storing each classification item and the corresponding data information in the database.

In the above technical solution, the user may set a set threshold (for example, set to 100) according to the actual situation, and then determine the number of secondary clustering centers obtained in the above solution by using the set threshold. If the set threshold value is larger than or equal to the set threshold value, the classification items obtained correspondingly are relatively more, the distinguishing points among the classification items are not obvious, and the classification effect is poor. And the obtained classified items are too many, so that the search time can be increased when the user invokes the classified items needed by the user from the database. Therefore, the K2 secondary clustering centers need to be subjected to clustering analysis again by using a K-means clustering algorithm, and the clustering analysis process is the same as that of the step 103. Until the number of the determined final cluster centers is smaller than a set threshold.

Through the technical scheme, whether cluster analysis needs to be performed again can be judged according to the number of the secondary cluster centers obtained after the secondary cluster analysis, so that the number of the obtained classified items can be ensured not to exceed a set threshold value, obvious distinguishing features are arranged among the classified items, the confusion situation can not be generated, and the classification effect is improved.

Step 102 specifically includes:

in step 1021, K1 first initial cluster centers are determined for each sample data.

In the step, after the first initial clustering centers are determined, the position of each first initial clustering center is found out in a coordinate system O1, and marked, so that the positioning is convenient.

Step 1022, calculating the distance between the data information in each sample data and the K1 first initial cluster centers.

In this stepDetermining the coordinates of the K1 first initial clustering centers in one sample data and the coordinates of other data information in a coordinate system O1, and calculating the distance from the other data information P1 (x 1, y 1) to each first initial clustering center P2 (x 2, y 2)

Step 1023, distributing the data information to the first category corresponding to the first initial cluster center with the shortest distance, wherein each sample data obtains K1 first categories and the data information corresponding to each first category.

In this step, if the distance between the data information and a certain first initial cluster center is shortest, it is proved that the data information is most similar to the first initial cluster center, and the data information and the first initial cluster center may be of a type. The data information is therefore classified into a first category corresponding to the first initial cluster center.

In step 1024, a first center point is determined for each first category of each sample data, and data information with the shortest distance from the first center point is selected as primary clustering centers, and then N sample data correspond to n×k1 primary clustering centers.

In this step, the obtained clustering centers of the data information corresponding to each first category are not accurate enough, and the corresponding clustering centers need to be selected again. Therefore, two data information which are farthest from each other in each first category are calculated, and the midpoint of the connecting line of the two data information is taken as the first center point of the corresponding first category. Since the first center point may be a virtual point and does not correspond to the data information, and cannot be used as a primary clustering center, the data information with the shortest distance from the first center point needs to be selected as the primary clustering center.

The first initial cluster center may be determined for each sample data in two different ways in step 1021.

First, if the user is less aware of the data information and experience is not good, the following steps are taken:

step 10211, randomly selecting from each sample dataA data information C1, calculating the distance D (x) between each data information in the sample data of C1 and C1, and according to the formulaEach data information is calculated as a probability value for the first initial cluster center.

Step 10212, using the K1 data information with the probability value greater than or equal to the predetermined probability value as the first initial cluster center.

Secondly, if the user knows the data information relatively, the number of first initial clustering centers of the initial clustering can be estimated, and then the following steps are adopted:

step 10211', setting the number K1 as the first initial cluster center number.

For example, when the user triggers the clustering center quantity setting button, the window for inputting quantity is popped up, and the user only needs to input the number which is considered to be reasonable from the window and click the confirm key.

Step 10212', randomly selecting K1 data information from each sample data as a first initial clustering center.

In the technical scheme, the user can select one from the two modes for determining the first initial clustering center according to the actual situation of the user, so that the user can use the first initial clustering center conveniently.

Step 103 specifically includes:

step 1031, selecting K2 from the n×k1 primary cluster centers as the second primary cluster center.

In this step, the selection process of the second initial cluster center is similar to the selection process of the first initial cluster center, specifically:

randomly selecting one data information C2 from N x K1 primary clustering centers by using a coordinate system O2, calculating the distance D (x) between each primary clustering center and the C2, and according to a formulaAnd calculating the probability value of each primary clustering center as a second primary clustering center. Will probability valuesAnd K2 primary clustering centers which are larger than or equal to a preset probability value are used as second primary clustering centers.

Or,

and setting the number K2 as the number of the second initial clustering centers, and randomly selecting K2 data information from the N x K1 initial clustering centers by using a coordinate system O2 as the second initial clustering centers.

Step 1032, calculating the distance between each primary cluster center and K2 secondary primary cluster centers.

In this step, the coordinates of the K2 second initial cluster centers are marked in the coordinate system O2, and then the distance is calculated.

And 1033, distributing the primary clustering centers to the second categories corresponding to the second initial clustering centers with the shortest distance, wherein K2 second initial clustering centers correspond to K2 second categories.

In this step, if the distance between the primary cluster center and a certain second initial cluster center is shortest, it is proved that the primary cluster center is the most similar to the second initial cluster center, and the primary cluster center and the second initial cluster center may be of a type. The primary cluster center is therefore assigned to a second category corresponding to the first primary cluster center.

Step 1034, determining a second center point for each second category, and selecting the primary clustering center with the shortest distance from the second center point as the secondary clustering center to obtain K2 secondary clustering centers.

In this step, the obtained cluster centers corresponding to the second categories are not accurate enough, and the corresponding cluster centers need to be selected again. Therefore, two initial clustering centers which are farthest from each other in each second category are selected, and the midpoint of the connecting line of the two initial clustering centers is taken as a second center point of the corresponding second category. Since the second center point may be a virtual point and cannot be used as a secondary clustering center, a primary clustering center having the shortest distance from the second center point needs to be selected as the secondary clustering center.

According to the technical scheme, the secondary clustering center is obtained after secondary K-means clustering analysis is performed on the basis of the primary clustering center obtained by primary clustering, so that the accuracy of the secondary clustering center is higher, and the classifying effect according to the secondary clustering center is better.

The step 101 specifically includes:

in step 1011, the total number of data information is obtained, the data information is divided into N pieces of sample data according to the average division of each piece of predetermined number of data information, wherein the number of the last piece of sample data is less than or equal to the predetermined number.

Or,

step 1011' acquires the maximum value A and the minimum value B of the data information, averages N equal divisions of A to B to obtain N groups of numerical ranges, and divides the data information into N samples according to the N groups of numerical ranges.

According to the technical scheme, because the number of the data information is huge, and meanwhile, the huge data information is subjected to cluster analysis, and a system breakdown situation can occur, so that the data information can be divided evenly according to the number, and also can be divided evenly according to the numerical value into N pieces of sample data, and then the cluster analysis can be performed on each piece of sample data, so that the effect of the cluster analysis is effectively improved.

Step 104 specifically includes:

step 1041, determining a corresponding classification item for each secondary cluster center.

Step 1042, calculating the distance between the data information and each secondary clustering center, and distributing the data information to the classification item corresponding to the secondary clustering center with the shortest distance.

Step 1043, storing the obtained K2 classification items and the corresponding data information in a database.

In the above technical solution, the user may name the classification item corresponding to each secondary cluster center according to his own practical experience, for example: higher experience staff, medium experience staff, lower experience staff, etc.

Then, K2 secondary cluster centers obtained according to the coordinate system O2 are transferred to the coordinate system O1, and each secondary cluster center is marked in the coordinate system O1. And calculating the distance between each data message and each secondary clustering center, and determining the correlation degree between each data message and the corresponding secondary clustering center according to the distance, wherein the shorter the distance is, the higher the correlation degree is proved. Therefore, the data information is distributed to the classification items corresponding to the secondary clustering centers with the shortest distance, and the classification task of the data information is completed.

According to the technical scheme, a large amount of data information can be subjected to sample division, the data is divided into N sample data, then a hard clustering algorithm is utilized to perform primary clustering analysis on each sample data, N x K1 primary clustering centers are obtained, then a hard clustering algorithm is utilized to perform secondary clustering analysis on the N x K1 primary clustering centers, K2 secondary clustering centers are obtained, the accuracy of the obtained secondary clustering centers is higher, the classification effect according to the secondary clustering centers is better, and each obtained classification item has a characteristic of being clear, so that a user can better distinguish each classification item and cannot be confused.

In another embodiment of the present application, a big data classification method based on a hard clustering algorithm includes the steps of:

1. obtaining a sample

For an insurance company, data (i.e., data information) of personal information of an insurance agent needs to be collected, including: gender, age, hobbies, height, weight, cultural degree, received customer volume, income, etc., and these personal data information were collected as a sample.

2. Primary clustering

And carrying out average division on the samples, dividing the samples into N equal parts, and respectively carrying out K-means clustering on the N equal parts of samples to obtain N x K1 clustering centers.

The specific clustering process is as follows:

(1) For each sample, K1 initial cluster centers are determined. The K1 value may be preset itself or determined as follows:

selecting one data information C1 for each sample at random, calculating the distance D (x) between other data information and C1, and calculating other dataProbability of information as initial cluster centerK1 data information with probability value larger than preset probability is selected as an initial clustering center.

(2) Calculating the distance between each data in each sample and the initial clustering center, distributing the data to the category corresponding to the initial clustering center with the shortest distance, recalculating the corresponding initial clustering center by using a K-means algorithm according to the data information of each category, and repeating the steps until the obtained initial clustering center is not changed.

(3) And (3) calculating primary clustering centers by adopting the modes of the steps (1) and (2) for N samples to obtain N.times.K1 primary clustering centers.

3. Secondary clustering

Taking N x K1 primary clustering centers as samples, and performing secondary clustering by using a K-means algorithm.

(1) The number of secondary cluster centers is set to K2. The K2 value may be preset itself or determined as follows:

from N x K1 primary clustering centers, randomly selecting one primary clustering center C2, calculating the distances D (x) between other primary clustering centers and C2 in a sample, and calculating probability values of the other primary clustering centers serving as primary clustering centers of secondary clusteringK2 data with probability values exceeding a preset probability value are taken as secondary clustering centers.

(2) And calculating the distance between each data message and the secondary clustering center, and distributing the data messages to the categories corresponding to the secondary clustering center with the shortest distance to obtain the data messages corresponding to K2 categories. And then, recalculating the corresponding secondary clustering center by using K-means for each category, and repeating the steps until the obtained secondary clustering centers are not changed, so that final K2 secondary clustering centers are obtained.

In addition, if the number of the obtained K2 secondary clustering centers is large, three or more times of clustering analysis can be performed, and the clustering centers finally obtained after multiple times of clustering are used for classifying the data information.

In the multi-clustering process, the re-clustered samples are the results of the last clustering, so that the clustering center is more accurate than that obtained by only performing the once clustering.

4. Classifying samples

And classifying all the obtained data information into K2 categories by using the obtained K2 secondary clustering centers.

Therefore, the leaders of the insurance companies can analyze the characteristics of the insurance agents of the corresponding categories according to the data information of the insurance agents of the corresponding categories, and then plan proper work for the characteristics of each category of insurance agents better.

In summary, according to the scheme, the clustering center is obtained accurately by performing clustering analysis on the collected data information for two or more times, so that the obtained categories are classified according to the clustering center, and the clustering effect is ideal. So that the groups of each category can exhibit respective characteristics. In addition, the reflection of population differences on features can be viewed based on known classification results, thereby combing the impact of agent features on their population classification.

Further, as a specific implementation of the method of fig. 1, an embodiment of the present application provides a big data classification device based on a hard clustering algorithm, as shown in fig. 2, where the device includes: an acquisition unit 21, a cluster analysis unit 22, and a classification unit 23.

An acquisition unit 21 for acquiring data information, dividing the data information into N pieces of sample data, wherein N is greater than or equal to 1;

the cluster analysis unit 22 is configured to perform primary hard cluster analysis on each sample data, and determine n×k1 primary cluster centers, where K1 is the number of primary cluster centers determined after the primary hard cluster analysis on each sample data, and K1 is greater than or equal to 1;

the cluster analysis unit 22 is further configured to perform secondary hard cluster analysis on the n×k1 primary cluster centers, and determine K2 secondary cluster centers, where K2 is the number of secondary cluster centers determined in the secondary hard cluster analysis, and K2 is greater than or equal to 1;

the classification unit 23 is configured to divide the data information into K2 classification items according to the K2 secondary clustering centers, and store each classification item and the corresponding data information in the database.

In a specific embodiment, the classification unit 23 specifically includes:

the judging module is used for judging whether the number K2 of the secondary clustering centers is larger than or equal to a set threshold value; if the judgment result is yes, performing hard cluster analysis on the K2 secondary cluster centers again until the number of the determined final cluster centers is smaller than a set threshold value; if the judgment result is negative, dividing the data information into K2 classification items according to the K2 secondary clustering centers, and storing each classification item and the corresponding data information in a database.

In a specific embodiment, the cluster analysis unit 22 specifically includes:

the center determining module is used for determining K1 first initial clustering centers for each sample data;

the distance calculation module is used for calculating the distance between the data information in each sample data and K1 first initial clustering centers;

the distribution module is used for distributing the data information to the first categories corresponding to the first initial clustering centers with the shortest distance, and each sample data obtains K1 first categories and the data information corresponding to each first category;

and the selecting module is used for determining a first center point for each first category of each sample data, selecting data information with the shortest distance from the first center point as primary clustering centers, and enabling N samples to correspond to N times of K1 primary clustering centers.

In a specific embodiment, the center determining module specifically includes:

the probability calculation module is used for randomly selecting one data information C1 from each sample data, calculating the distance D (x) between each data information in the sample data where the C1 is positioned and the C1, andaccording to the formulaCalculating probability values of the data information serving as a first initial clustering center; k1 data information with the probability value larger than or equal to a preset probability value is used as a first initial clustering center;

or,

the random module is used for setting the number K1 as the number of the first initial clustering centers; and randomly selecting K1 data information from each sample data as a first initial clustering center.

In a specific embodiment, the center determining module is further configured to select K2 from the n×k1 primary clustering centers as a second primary clustering center;

the distance calculation module is also used for calculating the distance between each primary clustering center and K2 second primary clustering centers;

the distribution module is further used for distributing the primary clustering centers to second categories corresponding to second initial clustering centers with shortest distances, wherein K2 second initial clustering centers correspond to K2 second categories;

the selecting module is further used for determining a second center point for each second category, selecting a primary clustering center with the shortest distance from the second center point as a secondary clustering center, and obtaining K2 secondary clustering centers.

In a specific embodiment, the obtaining unit 21 is further configured to obtain a total number of data information, divide the data information into N samples according to each predetermined number, where the number of the last sample is less than or equal to the predetermined number;

or,

and the method is also used for acquiring the maximum value A and the minimum value B of the data information, carrying out average N equal division on the A to the B to obtain N groups of numerical value ranges, and dividing the data information into N samples according to the N groups of numerical value ranges.

In a specific embodiment, the classification unit 23 specifically further includes:

the category determining module is used for determining a corresponding classification item for each secondary clustering center; calculating the distance between the data information and each secondary clustering center, and distributing the data information to the classification item corresponding to the secondary clustering center with the shortest distance;

and the storage module is used for storing the obtained K2 classification items and corresponding data information in a database.

Based on the above embodiment of the method shown in fig. 1 and the device shown in fig. 2, in order to achieve the above object, an embodiment of the present application further provides a computer device, as shown in fig. 3, including a memory 32 and a processor 31, where the memory 32 and the processor 31 are both disposed on a bus 33, and the memory 32 stores a computer program, and the processor 31 implements the hard clustering algorithm-based big data classification method shown in fig. 1 when executing the computer program.

Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile memory (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and includes several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective implementation scenario of the present application.

Optionally, the device may also be connected to a user interface, a network interface, a camera, radio Frequency (RF) circuitry, sensors, audio circuitry, WI-FI modules, etc. The user interface may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., bluetooth interface, WI-FI interface), etc.

Based on the embodiment of the method shown in fig. 1 and the device shown in fig. 2, correspondingly, the embodiment of the application also provides a storage medium, on which a computer program is stored, which when being executed by a processor, implements the big data classification method based on the hard clustering algorithm shown in fig. 1.

It will be appreciated by those skilled in the art that the structure of a computer device provided in this embodiment is not limited to the physical device, and may include more or fewer components, or may combine certain components, or may be arranged in different components.

The storage medium may also include an operating system, a network communication module. An operating system is a program that manages and quantifies the hardware and software resources of a transaction device, supporting the execution of information handling programs and other software and/or programs. The network communication module is used for realizing communication among all components in the storage medium and communication with other hardware and software in the quantitative transaction device.

From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general hardware platforms, or may be implemented by hardware.

By applying the technical scheme of the application, a large amount of data information can be divided into N sample data, then the hard clustering algorithm is utilized to perform primary clustering analysis on each sample data, N x K1 primary clustering centers are obtained, then the hard clustering algorithm is utilized to perform secondary clustering analysis on the N x K1 primary clustering centers, K2 secondary clustering centers are obtained, the accuracy of the obtained secondary clustering centers is higher, the classification effect according to the secondary clustering centers is better, and each obtained classification item has a characteristic of being clear, so that a user can better distinguish each classification item without confusion.

Those skilled in the art will appreciate that the drawing is merely a schematic illustration of a preferred implementation scenario and that the modules or flows in the drawing are not necessarily required to practice the application. Those skilled in the art will appreciate that modules in an apparatus in an implementation scenario may be distributed in an apparatus in an implementation scenario according to an implementation scenario description, or that corresponding changes may be located in one or more apparatuses different from the implementation scenario. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

The above-mentioned inventive sequence numbers are merely for description and do not represent advantages or disadvantages of the implementation scenario. The foregoing disclosure is merely illustrative of some embodiments of the application, and the application is not limited thereto, as modifications may be made by those skilled in the art without departing from the scope of the application.

Claims

1. The big data classification method based on the hard clustering algorithm is characterized by comprising the following steps:

performing primary hard cluster analysis on each sample data to determine N x K1 primary cluster centers, wherein K1 is the number of the primary cluster centers determined after the primary hard cluster analysis of each sample data, and K1 is more than or equal to 1;

and performing secondary hard cluster analysis on the N x K1 primary cluster centers to determine K2 secondary cluster centers, wherein the method specifically comprises the following steps of:

selecting K2 from N.times.K1 primary clustering centers as second primary clustering centers;

calculating the distance between each primary clustering center and K2 secondary primary clustering centers;

distributing the primary clustering centers to second categories corresponding to second initial clustering centers with the shortest distance, wherein K2 second initial clustering centers correspond to K2 second categories;

determining a second center point for each second category, and selecting a primary clustering center with the shortest distance from the second center point as a secondary clustering center to obtain K2 secondary clustering centers;

dividing the data information into K2 classification items according to the K2 secondary clustering centers, and storing each classification item and corresponding data information in a database;

the dividing the data information into K2 classification items according to the K2 secondary clustering centers, and storing each classification item and the corresponding data information in a database, specifically including:

judging whether the number K2 of secondary clustering centers is larger than or equal to a set threshold value;

if the judgment result is yes, performing hard cluster analysis on the K2 secondary cluster centers again until the number of the determined final cluster centers is smaller than a set threshold value;

if the judgment result is negative, dividing the data information into K2 classification items according to the K2 secondary clustering centers, and storing each classification item and the corresponding data information in a database;

determining a corresponding classification item for each secondary cluster center;

calculating the distance between the data information and each secondary clustering center, and distributing the data information to a classification item corresponding to the secondary clustering center with the shortest distance;

and storing the obtained K2 classification items and corresponding data information in a database.

2. The big data classification method according to claim 1, wherein the performing primary hard cluster analysis on each sample data to determine n×k1 primary cluster centers specifically includes:

determining K1 first initial cluster centers for each sample data;

calculating the distance between the data information in each sample data and K1 first initial clustering centers;

distributing the data information to first categories corresponding to first initial clustering centers with shortest distances, wherein each sample data obtains K1 first categories and data information corresponding to each first category;

and determining a first center point for each first category of each sample data, selecting data information with the shortest distance from the first center point as primary clustering centers, wherein N samples correspond to N times of K1 primary clustering centers.

3. The big data classification method according to claim 2, wherein the determining K1 first initial cluster centers for each sample data specifically includes:

randomly selecting one data information C1 from each sample data, calculating the distance D (x) between each data information in the sample data where C1 is located and C1, and according to a formulaCalculating probability values of the data information serving as a first initial clustering center;

k1 data information with the probability value larger than or equal to a preset probability value is used as a first initial clustering center;

or,

setting the number K1 as the number of the first initial clustering centers;

and randomly selecting K1 data information from each sample data as a first initial clustering center.

4. The big data classification method according to claim 1, wherein the acquiring the data information divides the data information into N pieces of sample data, specifically includes:

the method comprises the steps of obtaining the total number of data information, dividing the data information into N pieces of sample data according to the average division of each piece of preset number, wherein the number of the last piece of sample data is smaller than or equal to the preset number;

or,

and obtaining a maximum value A and a minimum value B of the data information, carrying out average N equal division on the A to the B to obtain N groups of numerical value ranges, and dividing the data information into N samples according to the N groups of numerical value ranges.

5. A big data classification device based on a hard clustering algorithm, the device comprising:

the cluster analysis unit is used for carrying out primary hard cluster analysis on each sample data to determine N x K1 primary cluster centers, wherein K1 is the number of the primary cluster centers determined after the primary hard cluster analysis of each sample data, and K1 is more than or equal to 1;

the cluster analysis unit specifically includes: the center determining module is used for selecting K2 primary clustering centers from N x K1 primary clustering centers to serve as second primary clustering centers;

the distance calculation module is used for calculating the distance between each primary clustering center and K2 second primary clustering centers;

the distribution module is used for distributing the primary clustering centers to second categories corresponding to second initial clustering centers with the shortest distance, wherein K2 second initial clustering centers correspond to K2 second categories;

the selecting module is used for determining a second center point for each second category, selecting a primary clustering center with the shortest distance from the second center point as a secondary clustering center, and obtaining K2 secondary clustering centers;

the classification unit is used for dividing the data information into K2 classification items according to the K2 secondary clustering centers and storing each classification item and the corresponding data information in a database;

the classifying unit specifically comprises: the judging module is used for judging whether the number K2 of the secondary clustering centers is larger than or equal to a set threshold value; if the judgment result is yes, performing hard cluster analysis on the K2 secondary cluster centers again until the number of the determined final cluster centers is smaller than a set threshold value; if the judgment result is negative, dividing the data information into K2 classification items according to the K2 secondary clustering centers, and storing each classification item and the corresponding data information in a database;

the classification unit specifically further comprises: the category determining module is used for determining a corresponding classification item for each secondary clustering center; calculating the distance between the data information and each secondary clustering center, and distributing the data information to the classification item corresponding to the secondary clustering center with the shortest distance;

6. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the hard clustering algorithm based big data classification method according to any of claims 1 to 4 when the computer program is executed.

7. A computer storage medium having stored thereon a computer program, which when executed by a processor realizes the steps of the hard clustering algorithm based big data classification method according to any of claims 1 to 4.