CN111814882B - Data classification system based on computer big data - Google Patents

Data classification system based on computer big data Download PDF

Info

Publication number
CN111814882B
CN111814882B CN202010663513.6A CN202010663513A CN111814882B CN 111814882 B CN111814882 B CN 111814882B CN 202010663513 A CN202010663513 A CN 202010663513A CN 111814882 B CN111814882 B CN 111814882B
Authority
CN
China
Prior art keywords
data
classification
records
new
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010663513.6A
Other languages
Chinese (zh)
Other versions
CN111814882A (en
Inventor
徐惠红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Eastern Liaoning University
Original Assignee
Eastern Liaoning University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Eastern Liaoning University filed Critical Eastern Liaoning University
Priority to CN202010663513.6A priority Critical patent/CN111814882B/en
Publication of CN111814882A publication Critical patent/CN111814882A/en
Application granted granted Critical
Publication of CN111814882B publication Critical patent/CN111814882B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques

Abstract

The present invention is directed to methods and systems for data processing in cloud computing environments and other distributed computing environments, wherein a merge classification suitable for data classification in cloud computing environments and other distributed computing environments is disclosed, taking advantage of the large amount of parallelism available in cloud computing environments, and taking into account the large number of constraints of data storage and data retrieval operations in cloud computing environments, thereby providing a type of data classification method and system that iteratively performs highly parallel merge classification operations that can be effectively applied to a range of data set sizes up to extremely large data sets.

Description

Data classification system based on computer big data
Technical Field
The present invention relates to the field of computers, and more particularly to data classification in cloud computing and other distributed computing environments, and more particularly to a data classification system based on computer big data.
Background
Many of the computational methods associated with data processing, data storage, and information retrieval are based on a single processor and directly connected data storage devices and other peripherals, and the data processing in such systems is sequential in nature, with the result that many of the classical data processing methods are sequential in nature and cannot process in parallel. With the development of computer networking and distributed computer systems, new types of computing methods have evolved to take advantage of the tremendous computing bandwidth that is possible when dividing and distributing computing tasks among a large number of concurrently executing processors and individual computing systems. More recently, the advent of cloud computing has again changed the fundamental constraints, capabilities, and dynamics associated with computing resources. As a result, new opportunities arise for developing new computing methods and systems implemented within cloud computing environments and other types of distributed computing environments. Although there are many different types of cloud computing facilities and environments, many cloud computing environments have certain characteristics that are shared in common. For example, cloud computing devices typically provide a virtual data storage subsystem for long-term data storage because the physical location of a user's virtual system or data center is dynamic within the cloud computing device. Thus, in many cloud computing environments, long-term data storage is often separate from computing.
Classification is a popular technique that divides a data set into multiple different categories based on similarity or dissimilarity measures to exploit the structure of the data set. Many classification algorithms have been proposed and widely applied to various realistic situations such as image processing, text mining, medicine, and biology. However, most existing classification algorithms require a predetermined number of classifications as input parameters, such as k-means algorithms, fuzzy c-means, k-mode, fuzzy k-mode, and the like. In many practical applications, a suitable number of clusters is often not available beforehand. This challenge leads to the necessary requirement for an automatic classification algorithm.
Due to the open accessibility of cloud information, cloud security is an important area of research today. By storing the data in encrypted form in the cloud, the confidentiality of the data is maintained. This process ensures security, but presents additional challenges when performing any operations on such data. Each time data needs to be returned to the process of the client unencrypted domain. Because the ciphertext is continually exposed, some security issues result. Furthermore, if the computation is done at the client, the basic target of the cloud computation may fall out. Therefore, direct processing of encrypted data is required, which is supported by a homomorphic encryption scheme.
The invention provides an automatic fuzzy classification method, which uses a non-dominated classification particle swarm optimization algorithm for classifying data, and the proposed algorithm can automatically identify the optimal classification quantity and utilize the classification result and the corresponding selected classification quantity.
Disclosure of Invention
The invention provides a data classification system based on computer big data, which is realized by adopting the following technical scheme in order to solve the technical problems.
A computer big data based data classification system comprising a cloud computing facility that allows users to configure remote, virtual computing systems and data centers and perform various types of computing tasks within these remote computer systems and data centers, the cloud computing facility providing users with virtual systems and data centers that map to actual physical servers, computers, data storage subsystems, and other remote physical data center components, in a cloud computing environment, data is stored in relatively large file objects or blocks similar to traditional computer systems, the file objects or blocks being associated with unique identifiers, the data objects or blocks representing the fundamental units of data storage provided by data storage subsystems provided within the cloud computing facility, the file objects or blocks including corresponding records, the records consisting of keys or key values, an indication of the data type of the key values and key values, whereby the records are ordered by a cloud merge sort operation, the object or block comprising kmax data records, ordered in increasing or decreasing key order when the records in the lowermost set of blocks are not ordered, wherein the order of the records produced by the cloud merge sort is defined by the implementation of a relational operator for comparing pairs of record keys during the cloud merge sort internal operation, the key values being directly moveable from the option data structure into the current output block when they are within the merge-sort range of key values, the next highest key value being found in one of the option data structures, and the record having the next highest key value not already stored in the option data structure being extracted from one of the input blocks; considering each data record as a particle, the obtained kmax is used to generate a control variable, the control variable is used to determine the number k of active classes in each particle, obtaining kmax by using a local density identification method, obtaining the number kmax of classes for grouping data by finding the number of data records with high density representatives, selecting a core object for classifying a data set by calculating the density of all data records, and arranging all data records in descending order according to the density, the data record with the highest density is picked up first, its neighbors are formed into an atomic group, after the first atomic cluster is formed, the process is repeated for the remaining clusters to find the center of the remaining clusters and their responding neighbors, thereby obtaining a set of atomic clusters.
Further, the step of acquiring kmax by using the local density identification method comprises the following steps:
let X be the set of n sorted data records with m attributes. Each data record may be described by a set of m classification attributes, thus xi={xi1,xi2,…,xim}. Classification data density is defined as follows:
Figure BDA0002579477570000031
Figure BDA0002579477570000032
object xje.X is defined as the nearest neighbor of the core object xi, its distance dij<dc, wherein dc is a cut-off distance, the distance between two category objects is calculated by adopting a Hamming distance measurement method, the Hamming distance measurement method is used for measuring the distance on the binary code, if two classification values are different, the distance between the two classification values is 1, otherwise, for two same characteristics, the distance is 0, and the selection of the cut-off distance is based on the neighborhood number which is about 1-2% of the data object number;
the classification compactness (pi) and the fuzzy separation (sep) are adopted as target functions, and the calculation method of the two values is as follows:
Figure BDA0002579477570000033
Figure BDA0002579477570000034
W=(Wji) Is a fuzzy membership matrix, k is kmax, Z is { Z }1,z2,…,zkIs the cluster center set, α is the weight factor, d (x)i,zj) Is the distance of object i to cluster j, d (z)j,zl) The distance from cluster j to l.
Further, the step of acquiring kmax by using the local density identification method further comprises:
the method comprises the following steps of 1, initializing a population, wherein the initialization process is to create a population containing N particles, each particle in the population is randomly initialized in a specified range, and the particles comprise two parts, namely control variables and distribution of clusters, the distribution of the clusters is based on a membership function generated by the control variables, the control variables are firstly generated to determine how many clusters are active, and the activity number h (p) of the clusters is adjusted in the initialization process to ensure that kmin is less than or equal to h (p) and less than or equal to kmax, wherein kmin is set to be 2;
step 2: set iteration t to 0, calculate all particlesThe fitness function of the child, individual best position (pbest) set to current position: pbest ═ pbestC,pbestW}={C(p),W(p)};
And step 3: and (3) increasing an iteration counter: t is t + 1;
and 4, step 4: randomly selecting a global best position (gbest) from the sorted list: gbest ═ gbest { (gbest)C,gbestW};
And 5: and (3) updating: each particle p contains a control variable and a new velocity and a new position of the cluster assignment, the new velocity and position of the control variable being updated first as follows:
Figure BDA0002579477570000041
Figure BDA0002579477570000042
where w is the inertia weight, Ct(p) and
Figure BDA0002579477570000043
respectively the position and velocity of the particle p at iteration t, c1And c2For a positive acceleration constant, a learning coefficient and a global learning coefficient are defined, r1And r2Are uniformly distributed in the interval [01]Two random numbers in the particle p, the control variable C (p) and its velocity V in the updating processC(p) may need to be adjusted because its value is greater than 1 or less than 0, and the new position of the control variable changes, resulting in a change in the number of activities in the cluster. Thus, the number of activities of the cluster is also updated as follows:
ht+1(p)=count(c(t+1)(p)|cj>0.5)
the new cluster allocation speed is matched with the new active cluster number, so that the new cluster allocation speed is updated, and an Intuitive Fuzzy Set (IFS) adding classification allocation speed is introduced to increase the flexibility of the membership function, which is as follows:
for each particle, update the speed to match the new number of active clusters:
Figure BDA0002579477570000044
calculating the hesitation degree of the IFS:
Figure BDA0002579477570000045
gamma is a hesitation degree control parameter;
the new speed is updated with the following function:
Figure BDA0002579477570000046
the new location of the cluster allocation is updated based on the new speed of the cluster allocation, and, due to the difference in the number of new active clusters,
Figure BDA0002579477570000051
may vary in magnitude from Wt(p) matching. Thus, Wt+1(p) the update procedure needs to be based on new
Figure BDA0002579477570000052
The adjustment is performed, and the updating process is as follows:
Figure BDA0002579477570000053
b=size Wt(p);
if a ═ b:
Figure BDA0002579477570000054
if a < b:
Wt(p)=size_reduce(Wt(p)|a);
Figure BDA0002579477570000055
if a > b:
Wt(p)=size_increase(Wt(p)|a);
Figure BDA0002579477570000056
using a function to adjust W of a previously allocated clustert(p) the size of the position, wherein a and b are each
Figure BDA0002579477570000057
And Wt(p) if a < b, W will bet(p) cutting into a size of a, normalizing to keep the relationship among all membership values, if b is less than a, adding more positions into Wt (p) by the function to obtain a size a, and generating a new fuzzy membership matrix corresponding to the size;
step 6: incorporation of the novel particle pt+1And the current pbest and stored in the nextPop _ list. nextpop _ list is a temporary fill of size 2 n;
and 7: applying a non-dominant ordering on nextPop _ list to identify and store non-dominant solutions in non-DomPSO _ list;
and 8: a new set of N particles was generated, and the next population was generated from non dompso _ list. The optimal value for the next iteration is updated from the new N particles.
And step 9: and returning to the step 3 until the termination condition is met.
Further, for each attribute 1 ∈ m, dens (xi) ═ 1/n, if | { xj∈X|xil=xjl}|=|xil=xjl1, conversely, Dens (x)i) 1, if | { xj∈X|xil=xjl}|=nxil=xjlN, the density of the classification object is therefore limited to 1/n ≦ Dens (x)i)≤1
Further, the set of blocks containing records of the data set is divided into subsets, each subset being allocated to a different computing resource within the cloud computing facility for classification, when the subsets have been classified, they are collected together again as a general partially classified intermediate data set, and then are re-divided and re-allocated among the computing resources to perform the next cloud merge classification step, the process of data subsets and re-collecting the sorted data subsets continues iteratively until the keys or key values in each subset are sorted and do not overlap with the keys of the other subsets. Further, the computer system includes one or more Central Processing Units (CPUs), one or more electronic memories interconnected with the CPUs by a CPU/memory subsystem bus, a first bridge interconnecting the high speed interconnection media of the CPU/memory subsystem bus with additional buses and/or other types, including multiple high speed serial interconnections. These buses or serial interconnects, in turn, connect the CPU and memory to a dedicated processor, such as a graphics processor, and one or more additional bridges to interconnect the high-speed serial link or controllers.
Further, the cloud merge classification classifies a sequence of records contained in a chunk into a sequential, classified group of records contained in the chunk. Records are sorted in an ascending or ascending key order in the top set of data blocks when records in the bottom set of data blocks are not sorted. The cloud merge classification can be used to classify records in ascending, descending, or more complex order, where the order of records produced by the cloud merge classification is defined by the implementation of relational operators for comparing pairs of record keys during internal operations of the cloud merge classification.
Further, the cloud computing facility provides an interface that allows users to assign, start and stop virtual servers and systems within the cloud and to start specific computing tasks on the assigned servers and other virtual systems.
Further, the contiguous sequence of bytes that make up the key may be interpreted differently as a string of symbols, an integer, a floating point number, or any other such data type, or may make up a sub-record containing multiple different fields with associated values.
The cloud merge classification of the present invention is designed to be reliable under cloud computing conditions. For example, each distributed task generated in each operation of each cloud merge-sort cycle is independent such that whenever a task fails, it can simply be restarted on the same or another computing resource. The distribution of the tasks generated in each operation may be performed according to any of a number of different types of task distribution schemes. The invention utilizes a method of automatic fuzzy clustering by a non-dominant particle swarm algorithm and a new technology of maximum classification number of a data set based on local density, can automatically identify the optimal cluster number and realizes the optimization of the cluster by utilizing the selected cluster number.
Drawings
FIG. 1 is a schematic flow chart of the present invention for obtaining kmax by using a local density identification method.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In modern computer systems and other processor-controlled devices and systems, the control components are typically implemented in whole or in part as sequences of computer instructions that are stored in one or more electronic memories, and in many cases also in one or more mass storage devices, and executed by one or more processors. As a result of their execution, processor-controlled devices or systems typically perform various operations at many different levels within the device or system, according to control logic implemented in stored and executed computer instructions. The processor-controlled device and the computer-instruction-implemented control components of the system are physical and physical as any other component of the system, including the power supply, cooling fan, electronic memory and processor, among other such physical components.
Cloud computing facilities allow users to configure remote, virtual computing systems and data centers and perform various types of computing tasks within these remote computer systems and data centers. Typically, cloud computing facilities provide users with virtual systems and data centers that map to actual physical servers, computers, data storage subsystems, and other remote physical data center components. A user may dynamically increase computational bandwidth and data storage capacity and dynamically return unused computational bandwidth and data storage capacity to respond to dynamically changing computational loads in a cost-effective manner. Users of cloud computing facilities essentially rent the underlying physical facilities, allowing users to concentrate on developing and deploying service applications and other programs without having to worry about assembling and maintaining physical data centers, without having to purchase and maintain large computing equipment to handle peak loads, idling during off-peak periods, and incurring power, maintenance, and physical housing costs.
In a cloud computing environment, data is stored within relatively large file objects similar to those found in conventional computer systems. These objects are associated with a unique identifier that allows the objects to be reliably stored in the data storage subsystem and subsequently retrieved. Any cloud computing facility server or other computer system may be authorized to access any data object stored within the data storage subsystem of the cloud computing facility. The cloud computing facility provides an interface that allows users to assign, start and stop virtual servers and systems within the cloud and to start specific computing tasks on the assigned servers and other virtual systems.
"cloud merge taxonomy" is applicable not only to cloud computing environments, but also to many other different types of distributed computing environments. A record consists of a key or key value, or rather a key and a value are two different parts of a plurality of bytes that together make up a record, the bytes that together make up the key being interpreted as the key value. For example, a contiguous sequence of bytes that constitutes a key may be interpreted differently as a string of symbols, an integer, a floating point number, or any other such data type, or may constitute a sub-record containing a number of different fields with associated values. The non-critical part of the record is the value or data value of the record.
The recording can have a significantly more complex internal structure. As an example, the keys and values in the record are themselves sub-records or multi-domain objects. The key includes an indication of a key size, an indication of a data type of a key value, and a key value, whereby the records are sorted by a cloud merge sort operation. Similarly, the value includes a numerical value size, a data type, and numerical value data. The fields and subfields within a record may be of fixed or variable length, depending on the particular type of record and the record definition. For example, variable length key values and variable length value data, have fixed length metadata fields that describe the overall size of the key, the data type of the key value, the overall size of the value, and the data type of the value data. The value portion of a record may include any hierarchy of sub-records with various fixed and variable length fields. However, for clarity and brevity, while the present document assumes a simple record, such as a record that includes a key for ordering the record and a value comprised of one or more bytes of information, the presently disclosed method and system may be adapted to order any type of record that includes at least one key value at which the record is ordered.
The cloud merge classification of the present invention is explained in more detail. The basic goal of the classification operation of cloud merge classification is to classify a group of records, converted by key values into an ordered sequence of sorted records. In an embodiment, the key values are represented by strings of symbols, and the sorting operation arranges the records in a dictionary or dictionary-like order. The presently disclosed cloud merge classification assumes that the key values may be ordered by relational operators, such as operator < ". A simple type implementation of the less-than-operator with two strings as arguments, each key representing a string of symbols defined by a command. Given that a relational operator function can be implemented for key values of a set of records, regardless of how complex the key values are defined, cloud merge classification can be used to classify records based on key values. The sort operation is two-dimensional, with one dimension being the last name and a different dimension being the first name.
Cloud merge classification operates on data records that are contained in larger data objects or chunks. A data object or block representing a basic unit of data storage provided by a data storage subsystem provided within a cloud computing facility. Such as: the data block comprises kmax data records, each data record comprising an integer key, such as key "103", containing a variable length value, or data, within the record. Cloud merge sorting sorts a sequence of records contained in a chunk into a sequential, sorted set of records contained in the chunk. Records are sorted in ascending or descending key order in the top set of data blocks when records in the bottom set of data blocks are not sorted. The cloud merge classification can be used to classify records in ascending, descending, or more complex order, where the order of records produced by the cloud merge classification is defined by the implementation of relational operators for comparing pairs of record keys during internal operations of the cloud merge classification.
The cloud computing environment includes a remote physical data processing center for providing a virtual system or virtual data center interface to interconnect users accessing the cloud computing facility through personal computers, servers or other computer systems with cloud computing devices via various types of local and wide area networks, typically including the internet, the cloud computing devices providing users with the ability to initiate programs and other computing tasks, on multiple server systems or other virtual computer systems, each of which may store and retrieve data objects or blocks to and from data storage subsystems or object storage provided within the cloud computing environment. Each data object or block includes object data and the object is associated with an object that uniquely identifies the object and serves as a tag or identifier with which the data object may be stored to and retrieved from the object store by a virtual server or other computing entity. The object ID may be at least unique among virtual systems and data centers assigned for a particular task for a particular time period, and may be globally unique among all users and computing tasks associated with the cloud computing facility.
The computer system contains one or more Central Processing Units (CPUs), one or more electronic memories interconnected with the CPUs by a CPU/memory subsystem bus, and a first bridge interconnecting the high speed interconnection media of the CPU/memory subsystem bus, with additional buses, and other types, including multiple high speed serial interconnections. These buses or serial interconnects, in turn, connect the CPU and memory to a dedicated processor, such as a graphics processor, and one or more additional bridges to interconnect the high-speed serial link or controllers. Such as controllers that provide access to a variety of different types of mass storage devices, electronic displays, input devices, and other such components, subcomponents, and computing resources.
Cloud merge classification begins with an entire data set that includes a plurality of chunks, each chunk including a plurality of records, which are typically initially unclassified. A set of blocks containing records of a data set is divided into subsets, each subset being allocated to a different computing resource within the cloud computing facility. Thus, the initial data set is divided into data subsets, each of which is allocated to a computing entity or resource within the cloud computing facility for classification. When the subsets have been classified, they are again collected together as a partially classified intermediate data set, and then repartitioned and reallocated among the computing resources to perform the next cloud merge classification step. The process of data subsets and re-collecting the sorted data subsets continues iteratively until the keys or key values in each subset are sorted and do not overlap with the keys of the other subsets, except for the overlap at the beginning and end of adjacent blocks containing multiple records with the same key value.
The computing flow takes advantage of the characteristics of a cloud computing facility, with each step resulting in highly parallel ordering of many data subsets in order to achieve high computing throughput for cloud merge ordering. Because block writes and block reads may be associated with relatively large latencies in a cloud computing environment, block reads and writes are also highly paralyzed, with only a relatively small number of block reads and block writes being performed per task in each cycle or step. Cloud merge classification is designed to minimize the total number of steps required to fully classify an initial dataset and is typically implemented in a log N-step order to classify N records.
A manifest is an object in the object-oriented programming sense of the word "object". It may also be considered a data structure. Three different blocks are shown in the left hand column, each block comprising a plurality of records, each record consisting of a digital key and data such as a digital key, and each block of blocks within a data record being associated with an object identifier, including an object identifier or a block identifier. A manifest object represents a data set or subset of data that includes one or more record blocks. The manifest object includes a field indicating whether records in a block associated with the manifest are ordered.
The first member function ADD ADDs 2 to the block of manifest M to manifest M1 and calls member function ADD thereto. Manifest M1 has two chunks containing 30 records and manifest M2 has three chunks containing 45 records. Block 2, which adds operations to the additional manifest M, shows M1 to create a larger manifest NIL.
Member functions associated with the manifest class convert block partitions within the manifest into a set of classification tasks in memory that are allocated for execution in the computing resources of the cloud computing environment. After performing these tasks, the blocks associated with each task are sorted, as the columns of completed tasks are sorted in a cloud merge, and the blocks associated with completed tasks are reassembled back into a single manifest. Thus, the member function creates a set of classification tasks in a single block of memory, each task corresponding to a different block in the manifest associated with the task assigned for execution. After execution, the sorted chunks are reassembled into a single manifest object. Note that the manifest describes blocks stored within a data storage subsystem of the computing environment. These blocks are not stored within the manifest object, but are referenced by the manifest object using the data object identifier.
In a first step, the first record from each input block is read into the corresponding option data structure, the record stored in one of the option data structures having the lowest key value, in which case the key value "1" stored in the option data structure is moved to the uppermost data structure, or, in an alternative embodiment, the next highest key value referenced by the top of the pointer is identified in the input block, which has not yet been moved to the option data structure, in which case identifying a key value "7" block in the record and moving the record containing that key value into the option data structure writes the record from it to the uppermost data structure. The key value in the record is in the top data structure and then compared to the value of the BEGIN _ ON data member, and since the key value "1" in the record stored in the uppermost data structure is less than the value "6" stored in the BEGIN _ ON data member, the record currently stored in the top data structure is discarded. In the alternative, the key value may be moved directly from the option data structure into the current output block when the key value is within the merge-sort range of key values.
The next highest key value is found in one of the option data structures, in this case the option data structure, and the move-to-top data structure, and the record with the next highest key value not already stored in the option data structure is extracted from one of the input blocks, in this case the record and the move-to option data structure, from which the record is removed and written to the topmost data structure, and again, the record stored in the topmost data structure is discarded because the key value "2" is less than the value "6" stored in the begin _ on data member. Similar steps continue until the top-level data structure contains a record with a key value of "6" equal to the value stored in the BEGIN _ ON data member. In this case, the record is written from the top data structure to the current output block as the first record.
Again, in each record move operation, the next record is selected from the option data structure moved to the topmost data structure, and the record with the next highest key value is deleted from one of the input blocks, and the record deleted from the option data structure in which the record was deleted is replaced with the record removal for writing to the top data structure. Now, however, the record is moved from the top data structure into the current output block, rather than being discarded, because the record's key value is greater than the value stored in the BEGIN _ ON data member. Thus, the record having the next highest key value of "7" becomes the second record in the initial output block. As the process continues, the current output block is eventually filled.
A single manifest representing all data records is divided into data subsets associated with tasks distributed in the computing resources of the cloud computing facility. The data subsets are sorted by a merge-sort operation and then recombined into a single data subset represented by a single manifest object.
Regarding how to determine the "kmax" of the data records in each object or block, the conventional approach is to try different numbers of classifications (k) in the [ kmin, kmax ] range and evaluate the classification results using an internal Classification Validation Index (CVI). The optimal classification number (optimal k) is selected according to the CVI criteria. The merge-split based approach centers around k-means, expectation maximization, and statistical criteria. The invention proposes a new automatic fuzzy classification algorithm based on non-dominated sorting particle swarm optimization for classifying data, which is also a local density-based kmax method to identify the maximum cluster number kmax based on local density, instead of using an empirical method, wherein kmax is identified as the square root of the number of objects in a data set. The obtained kmax will be used to generate a control variable, which will be used to determine the activity k in each particle, considering each data record as a particle. The design of each particle is divided into two parts: 1) a control variable for identifying the classification number k, 2) a classification assignment for dividing data. And (4) evaluating the particles by adopting two objective functions of global compactness and fuzzy separation. The iterative process updates the velocity and position of each particle in the cluster. During the updating process, some adjustment strategies are set for the distribution of control variables and clusters to avoid infeasible solutions.
Identifying kmax plays an important role in the algorithm. Most of the existing classification algorithms adopt empirical rules
Figure BDA0002579477570000121
Wherein n is the number of data records. However, this rule is not very reliable, as there is no theory that indicates the number of instances in the datasetThere is a relationship between the amount and the number of clusters. Furthermore, using empirical rules to identify kmax may introduce drawbacks, namely that for large data sets kmax is too large, resulting in increased time complexity, or when the number of real clusters is large
Figure BDA0002579477570000122
When this happens, kmax may be unacceptable. The invention provides a kmax identification method based on local density. In general, a classified centroid is an object with a higher local density, surrounded by neighboring objects with a lower local density. This means that if we can find the number of representatives with a high density, the number of classifications for the packet data can be obtained.
Let X be the set of n sorted data records with m attributes. Each data record may be described by a set of m classification attributes, thus xi={xi1,xi2,...,xim}. Classification data density is defined as follows:
Figure BDA0002579477570000123
Figure BDA0002579477570000124
next, we determine the upper and lower bounds of the domain object density. For each attribute 1 ∈ m, Dens (xi) ═ 1/n, if | { xj∈X|xil=xjl}|=|xil=xjl1, conversely, Dens (x)i) 1, if | { xj∈X|xjl=xjl}|=nxil=xjlN, the density of the classification object is therefore limited to 1/n ≦ Dens (x)i) Less than or equal to 1. However, the probability of having dens (xi) ═ 1 is very rare, since this means that the objects are completely similar.
Obviously, the higher the density of objects, the more objects there are around, and the greater the probability of being selected as the center of the constellation. Selecting the core data record for the classified data set requires calculating the density of all data records. The method sorts all objects in descending order according to density. The most dense object is picked first and its neighbors are found to form a radical. After the first cluster is formed, the process is repeated for the remaining clusters to find the remaining cluster centers and their responding neighbors, resulting in a collection of clusters. kmax is also defined as the number of clusters of atoms.
To form an atomic cluster, the nearest neighbor object needs to be defined from the non-core object. The nearest neighbor object is identified using the distance of the non-core object to the core object. Object xje.X is defined as the nearest neighbor of the core object xi, its distance dij<dc, where dc is the cut-off distance. The distance between two category objects is calculated by adopting a Hamming distance measurement method, and the distance on the binary code is measured by adopting the Hamming distance measurement method. If the two classification values are different, the distance between them is 1. Otherwise, the distance is 0 for two identical features. The cutoff distance is chosen based on the number of neighbors being about 1-2% of the number of data objects.
The population of particles is represented as a fuzzy matrix of size kx (1+ n), where n is the number of data instances and k equals kmax. The first column of the particle shows the control variables of the clusters, which are used to determine how many clusters should be used for a given data set. The control variables are randomly generated from 0 to 1. If the value is greater than 0.5, indicating that a class exists for the controlled variable, assigning an object to the class based on the controlled variable using a fuzzy membership function. And when the controlled variable is less than 0.5, the variable no longer has a classification, and the corresponding fuzzy membership value is zero. The distribution of the classification is fuzzy membership matrix W ═ W (W)ji) J is 1, …, k, as shown in table 1.
Figure BDA0002579477570000131
TABLE 1
Figure BDA0002579477570000132
TABLE 2
The algorithm uses class compactness (π) and fuzzy separation (sep) as objective functions. The two values are calculated as follows:
Figure BDA0002579477570000141
Figure BDA0002579477570000142
W=(Wji) Is a fuzzy membership matrix, k is kmax, Z is { Z }1,z2,…,zkIs the cluster center set, α is the weight factor, d (x)i,zj) Is the distance of object i to cluster j, d (z)j,zl) The distance from cluster j to l.
Step 1, population initialization. The initialization process will create a population of N particles. Each particle in the population is randomly initialized within a specified range. A particle consists of two parts, a control variable and the allocation of clusters. The assignment of clusters is a membership function generated based on the control variables. Thus, control variables are first generated to determine how many clusters are active.
The number of activities h (p) of the cluster should be adjusted during initialization to ensure kmin ≦ h (p) ≦ kmax, where kmin is set to 2. Therefore, if all the bits of the control variables in the particle c (p) are less than 0.5, the number of active clusters at h (p) ═ kmin is selected.
And 2, setting the iteration t to be 0, and calculating the fitness functions of all the particles. An individual best position (pbest) is set to the current position pbest ═ pbestC,pbestW}={C(p),W(p)}。
And step 3, increasing an iteration counter t ═ t + 1.
And 4, randomly selecting a global best position (gbest) from the sorted list, wherein the gbest is { gbest }C,gbestW}。
Step 5, updating the process, wherein each particle p comprises a control variable and a new speed and a new position of the cluster allocation. The new speed and position of the first updated control variable is as follows:
Figure BDA0002579477570000143
Figure BDA0002579477570000144
where w is the inertia weight, Ct(p) and
Figure BDA0002579477570000145
respectively the position and velocity of the particle p at the iteration t. c. C1And c2For the positive acceleration constant, a learning coefficient and a global learning coefficient are defined, respectively. r is1And r2Are uniformly distributed in the interval [01]Two random numbers in. During the update, the control variable C (p) and its velocity V in the particle pC(p) may need to be adjusted because its value is greater than 1 or less than 0. In the present invention, it will be adjusted to 1 or 0. Here, the new position of the control variable is changed, which results in a change in the number of activities of the cluster. Thus, the number of activities of the cluster is also updated as follows:
ht+1(p)=count(C(t+1)(p)|cj>0.5)
the new cluster allocation speed is matched to the new number of active clusters, thereby updating the new cluster allocation speed. In addition, the present embodiment also adds classification distribution speed using an Intuitive Fuzzy Set (IFS) to increase the flexibility of the membership function, which is as follows:
for each particle, update the speed to match the new number of active clusters:
Figure BDA0002579477570000151
calculating the hesitation degree of the IFS:
Figure BDA0002579477570000152
gamma is a hesitation degree control parameter;
the new speed is updated with the following function:
Figure BDA0002579477570000153
the new location of the cluster allocation is updated based on the new speed of the cluster allocation. However, due to the difference in the number of new active clusters,
Figure BDA0002579477570000154
may vary in magnitude from Wt(p) matching. Thus, Wt+1(p) the update procedure needs to be based on new
Figure BDA0002579477570000155
And (6) adjusting. The update process is as follows:
Figure BDA0002579477570000156
b=size Wt(p);
if a ═ b:
Figure BDA0002579477570000157
if a < b:
Wt(p)=size_reduce(Wt(p)|a);
Figure BDA0002579477570000158
if a > b:
Wt(p)=size_increase(Wt(p)|a);
Figure BDA0002579477570000159
it can thus be seen that a function is used to adjust the W of a previously allocated clustert(p) size of the position. Wherein a and b are each independently
Figure BDA00025794775700001510
And Wt(p) size. If a < b, W will be adjustedt(p) cut into a-sizes and normalized to preserve the relationship between all membership values. And if b < a, the function will be at WtAnd (p) adding more positions to obtain the size a, and generating a new fuzzy membership matrix corresponding to the size.
The new position W is similar to the process of updating the control variable in the particlet+1(p) may exceed the threshold membership function. Thus, for the final Wt+1(p) normalization to ensure membership values.
Step 6 merging new particles pt+1And the current pbest and stored in the nextPop _ list. nextpop _ list is a temporary fill of size 2N.
Step 7, apply the non-dominant ordering on nextPop _ list to identify the non-dominant solution and store in the non DomPSO _ list.
Step 8, generate a new set of N particles, generate the next population from non DomPSO _ list. The optimal value for the next iteration is updated from the new N particles.
And 9, returning to the step 3 until the termination condition is met.
The cloud merge classification of the present invention is designed to be reliable under cloud computing conditions. For example, each distributed task generated in each operation of each cloud merge-sort cycle is independent such that whenever a task fails, it can simply be restarted on the same or another computing resource. The distribution of the tasks generated in each operation may be performed according to any of a number of different types of task distribution schemes. The invention utilizes a method of automatic fuzzy clustering by a non-dominant particle swarm algorithm and a new technology of maximum classification number of a data set based on local density, can automatically identify the optimal cluster number and realizes the optimization of the cluster by utilizing the selected cluster number.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (8)

1. A computer big data based data classification system, characterized in that it comprises a cloud computing facility allowing users to configure remote, virtual computing systems and data centers and perform various types of computing tasks within these remote computer systems and data centers, the cloud computing facility providing users with virtual systems and data centers mapped to actual physical servers, computers, data storage subsystems and other remote physical data center components, in a cloud computing environment data is stored in relatively large file objects or blocks in computer systems associated with unique identifiers representing the basic units of data storage provided by data storage subsystems provided within the cloud computing facility, the file objects or blocks including corresponding records, the records consisting of keys or key values, the key values comprising an indication of the data type and key values, whereby the records are ordered by a cloud merge sort operation, the file object or block comprising kmax data records, ordered in increasing or decreasing key order when the records in the lowest set of blocks are not ordered, wherein the order of records produced by the cloud merge sort is defined by the implementation of a relational operator for comparing pairs of record keys during a cloud merge sort internal operation, the key values being moveable directly from an option data structure into a current output block when the key values are within a merge-sort range of key values, the next highest key value being found in one of the option data structures, and extracting a record from one of the input blocks having the next highest key value not already stored in the option data structure; considering each data record as a particle, obtaining kmax to be used for generating a control variable, wherein the control variable is used for determining the activity classification number k in each particle, k belongs to kmax, obtaining kmax by adopting a local density identification method, obtaining the classification number for grouping data by finding the number of data records with high density representatives, selecting a core object for classifying a data set by calculating the density of all the data records, arranging all the data records in a descending order according to the density, picking up the data record with the maximum density, forming a cluster by the neighbor of the data record, repeating the process for the rest clusters after forming the first cluster to find the centers of the rest clusters and the neighbors corresponding to the centers of the rest clusters, and obtaining a cluster set of the clusters;
the method for acquiring kmax by adopting the local density identification method comprises the following steps:
let X be a collection of n classification data records having m attributes, each data record being describable by a set of m classification attributes, so Xi={xi1,xi2,…,ximThe classification data density is defined as follows:
Figure FDA0003020780600000011
Figure FDA0003020780600000012
object xje.X is defined as a core object XiOf the nearest neighbor of distance dij<dc, wherein dc is a cut-off distance, the distance between two category objects is calculated by adopting a Hamming distance measurement method, the Hamming distance measurement method is used for measuring the distance on the binary code, if two classification values are different, the distance between the two classification values is 1, otherwise, for two same characteristics, the distance is 0, and the selection of the cut-off distance is based on the number of neighbors as 1-2% of the number of the data objects;
the classification compactness pi and the fuzzy separation sep are adopted as target functions, and the calculation method of the two values is as follows:
Figure FDA0003020780600000021
Figure FDA0003020780600000022
W=(Wji) For fuzzy membership matrix, let k be kmax, and Z be { Z }1,z2,…,zkIs the cluster center set, α is the weight factor, d (x)i,zj) Is the distance of object i to cluster j, d (z)j,zl) The distance from cluster j to l.
2. The system for classifying data based on big computer data according to claim 1, wherein the step of obtaining kmax by using the local density identification method further comprises:
the method comprises the following steps of 1, initializing a population, wherein the initialization process is to create a population containing N particles, each particle in the population is randomly initialized in a specified range, and the particles comprise two parts, namely control variables and distribution of clusters, the distribution of the clusters is based on a membership function generated by the control variables, the control variables are firstly generated to determine how many clusters are active, and the activity number h (p) of the clusters is adjusted in the initialization process to ensure that kmin is less than or equal to h (p) and less than or equal to kmax, wherein kmin is set to be 2;
step 2, setting iteration t as 0, calculating fitness functions of all particles, setting the individual optimal position pbest as the current position pbest, and setting pbest as { pbest ═ pbestC,pbestW}={C(p),W(p)};
Step 3, increasing an iteration counter, wherein t is t + 1;
step 4, randomly selecting global optimum position gbest from the sorted list, wherein the gbest is { gbest }C,gbestW};
And 5, updating the process, wherein each particle p comprises a control variable and a new speed and a new position of cluster distribution, and the new speed and the new position of the control variable are updated firstly as follows:
Figure FDA0003020780600000023
Figure FDA0003020780600000031
where w is the inertia weight, Ct(p) and
Figure FDA0003020780600000032
respectively the position and velocity of the particle p at iteration t, c1And c2For a positive acceleration constant, a learning coefficient and a global learning coefficient are defined, r1And r2Are uniformly distributed in the interval [01]Two random numbers in the particle p, the control variable C (p) and its velocity V in the updating processC(p) may need to be adjusted because its value is greater than 1 or less than 0, and a change in the new position of the control variable may result in a change in the number of activities of the cluster, and therefore, the number of activities of the cluster is also updated as follows:
ht+1(p)=count(C(t+1)(p)|cj>0.5)
the new cluster allocation speed is matched with the new active cluster number, so that the new cluster allocation speed is updated, and an intuitive fuzzy set IFS (initial fuzzy set) is introduced to add a classification allocation speed so as to increase the flexibility of the membership function, which is specifically as follows:
for each particle, update the speed to match the new number of active clusters:
Figure FDA0003020780600000033
calculating the hesitation degree of the IFS:
Figure FDA0003020780600000034
gamma is a hesitation degree control parameter;
the new speed is updated with the following function:
Figure FDA0003020780600000035
the new location of the cluster allocation is updated based on the new speed of the cluster allocation, and, due to the difference in the number of new active clusters,
Figure FDA0003020780600000036
will vary in magnitude from Wt(p) match, thus, Wt+1(p) the update procedure needs to be based on new
Figure FDA0003020780600000037
The adjustment is performed, and the updating process is as follows:
Figure FDA0003020780600000038
b=size Wt(p);
if a ═ b:
Figure FDA0003020780600000039
if a < b:
Wt(p)=size_reduce(Wt(p)|a);
Figure FDA00030207806000000310
if a > b:
Wt(p)=size_increase(Wt(p)|a);
Figure FDA0003020780600000041
using a function to adjust W of a previously allocated clustert(p) the size of the position, wherein a and b are each
Figure FDA0003020780600000042
And Wt(p) size if a<b, will be Wt(p) cut to a size and normalized to maintain the relationship between all membership values if b<a, the function will be at Wt(p) adding more positions to obtain the size a, and generating a new fuzzy membership matrix corresponding to the size;
step 6 merging new particles pt+1And the current pbest and stored in the nextPop _ list, which is a temporary fill of size 2 n;
step 7, applying non-dominant ordering on nextPop _ list to identify non-dominant solutions and store in non-DomPSO _ list;
step 8, generating a group of new N particles, generating a next population from the non-DomPSO _ list, and updating the optimal value of the next iteration from the new N particles;
and 9, returning to the step 3 until the termination condition is met.
3. A computer big data based data classification system according to any of the claims 1-2, characterized in that for each attribute/e m, Dens (x)i) 1/n, if | { xj∈X|xil=xjl}|=|xil=xjl1, conversely, Dens (x)i) 1, if | { xj∈X|xil=xjl}|=nxil=xjlN, the density of the classification object is therefore limited to 1/n ≦ Dens (x)i)≤1。
4. A computer big data based data classification system according to any of claims 1-2, characterized in that the set of blocks containing records of a data set is divided into subsets, each subset being allocated to different computing resources within the cloud computing facility for classification, when subsets have been classified they are collected together again as a general partially classified intermediate data set, then subdivided and reallocated between computing resources to perform the next cloud merge classification step, the process of data subsets and of recollecting sorted data subsets continues iteratively until the keys or key values in each subset are sorted and do not overlap with the keys of the other subsets.
5. A computer big data based data sorting system according to claim 1, where the computer system contains one or more central processing units, CPUs, one or more electronic memories interconnected with the CPUs by a CPU/memory subsystem bus, a first bridge interconnecting high speed interconnection media with additional buses and/or other types of high speed interconnection media of the CPU/memory subsystem bus, including multiple high speed serial interconnects which in turn connect the CPUs and memories with dedicated processors such as graphics processors and one or more additional bridges interconnecting high speed serial links or multiple controllers.
6. The computer big data-based data classification system of claim 1, wherein a cloud merge classification classifies a sequence of records contained in a set of chunks into sequential, groups of classified records contained in the set of chunks, the records being sorted in ascending or ascending key order in a top set of data chunks when records in a bottom set of chunks are not sorted, the cloud merge classification being operable to classify the records in ascending, descending, or more complex order, wherein the order of records resulting from the cloud merge classification is defined by an implementation of a relational operator for comparing pairs of record keys during internal operations of the cloud merge classification.
7. A computer big data based data classification system according to claim 3, characterized in that the cloud computing facility provides an interface that allows users to assign, start and stop virtual servers and systems within the cloud and start specific computing tasks on the assigned servers and other virtual systems.
8. A computer big data based data classification system according to claim 3, characterized in that consecutive byte sequences constituting a key can be interpreted differently as a string of symbols, an integer, a floating point number or any other such data type, or can constitute a sub-record containing a number of different fields with associated values.
CN202010663513.6A 2020-07-10 2020-07-10 Data classification system based on computer big data Active CN111814882B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010663513.6A CN111814882B (en) 2020-07-10 2020-07-10 Data classification system based on computer big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010663513.6A CN111814882B (en) 2020-07-10 2020-07-10 Data classification system based on computer big data

Publications (2)

Publication Number Publication Date
CN111814882A CN111814882A (en) 2020-10-23
CN111814882B true CN111814882B (en) 2021-06-22

Family

ID=72843017

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010663513.6A Active CN111814882B (en) 2020-07-10 2020-07-10 Data classification system based on computer big data

Country Status (1)

Country Link
CN (1) CN111814882B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737126A (en) * 2012-06-19 2012-10-17 合肥工业大学 Classification rule mining method under cloud computing environment
CN104699772A (en) * 2015-03-05 2015-06-10 孟海东 Big data text classifying method based on cloud computing
CN108039962A (en) * 2017-12-05 2018-05-15 三盟科技股份有限公司 A kind of cloud calculation service Resource Calculation method and system under the environment based on big data
CN108255935A (en) * 2017-11-29 2018-07-06 江苏摇铃网络科技有限公司 A kind of method based on data mining algorithm analysis big data
CN109460829A (en) * 2018-11-05 2019-03-12 常熟理工学院 Based on the intelligent monitoring method and platform under big data processing and cloud transmission

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013184975A2 (en) * 2012-06-06 2013-12-12 Spiral Genetics Inc. Method and system for sorting data in a cloud-computing environment and other distributed computing environments

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737126A (en) * 2012-06-19 2012-10-17 合肥工业大学 Classification rule mining method under cloud computing environment
CN104699772A (en) * 2015-03-05 2015-06-10 孟海东 Big data text classifying method based on cloud computing
CN108255935A (en) * 2017-11-29 2018-07-06 江苏摇铃网络科技有限公司 A kind of method based on data mining algorithm analysis big data
CN108039962A (en) * 2017-12-05 2018-05-15 三盟科技股份有限公司 A kind of cloud calculation service Resource Calculation method and system under the environment based on big data
CN109460829A (en) * 2018-11-05 2019-03-12 常熟理工学院 Based on the intelligent monitoring method and platform under big data processing and cloud transmission

Also Published As

Publication number Publication date
CN111814882A (en) 2020-10-23

Similar Documents

Publication Publication Date Title
US11574202B1 (en) Data mining technique with distributed novelty search
Li et al. Clustering for approximate similarity search in high-dimensional spaces
US20130151535A1 (en) Distributed indexing of data
US20080082531A1 (en) Clustering system and method
Junaid et al. Modeling an optimized approach for load balancing in cloud
WO2016162338A1 (en) Load balancing for large in-memory databases
CN108833302B (en) Resource allocation method based on fuzzy clustering and strict bilateral matching in cloud environment
Chhikara et al. An efficient container management scheme for resource-constrained intelligent IoT devices
WO2022007596A1 (en) Image retrieval system, method and apparatus
CN110019017B (en) High-energy physical file storage method based on access characteristics
Guo et al. A resource aware MapReduce based parallel SVM for large scale image classifications
He Evolutionary K-Means with pair-wise constraints
CN116680090B (en) Edge computing network management method and platform based on big data
Liu et al. A MapReduce based distributed LSI for scalable information retrieval
CN111814882B (en) Data classification system based on computer big data
Elmeiligy et al. An efficient parallel indexing structure for multi-dimensional big data using spark
CN109784354A (en) Based on the non-parametric clustering method and electronic equipment for improving classification effectiveness
JP2021179859A (en) Learning model generation system and learning model generation method
CN112215655A (en) Client portrait label management method and system
Antaris et al. Similarity search over the cloud based on image descriptors' dimensions value cardinalities
Eren et al. A K-means algorithm application on big data
Li et al. A fast approach of provisioning virtual machines by using image content similarity in cloud
Nadaf et al. Performance evaluation of categorizing technical support requests using advanced K-means algorithm
Tareq et al. A new density-based method for clustering data stream using genetic algorithm
Xiong et al. Research on MapReduce parallel optimization method based on improved K-means clustering algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant