CN118069359A

CN118069359A - Data processing method, device, computer equipment and storage medium

Info

Publication number: CN118069359A
Application number: CN202410234031.7A
Authority: CN
Inventors: 张季冬; 贾伟; 王晓通
Original assignee: Jinan Inspur Data Technology Co Ltd
Current assignee: Jinan Inspur Data Technology Co Ltd
Priority date: 2024-02-29
Filing date: 2024-02-29
Publication date: 2024-05-24

Abstract

The invention relates to the technical field of data processing, and discloses a data processing method, a device, computer equipment and a storage medium, wherein the data processing method is applied to a main node in a distributed computing system, the distributed computing system comprises the main node and a plurality of sub-nodes, and the method comprises the following steps: performing slicing processing on the data in the first data set to obtain a plurality of data slices; determining the number of the data fragments currently distributed by each sub-node according to the load condition when each sub-node processes the data fragments distributed last time; according to the number of the data fragments currently distributed by each child node, transmitting the corresponding number of the data fragments to each child node, so that each child node processes the data in the data fragments; receiving the processed data fragments from each sub-node; and assembling the data in the plurality of processed data fragments to obtain a second data set. The invention can ensure the load balance of the computing nodes and improve the data processing efficiency.

Description

Data processing method, device, computer equipment and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data processing method, a data processing device, a computer device, and a storage medium.

Background

In the process of data collection and storage, due to human factors, equipment faults, transmission errors and the like, errors, missing, repetition or inconsistency and other problems often exist in data, for example, data input errors, non-uniform data formats, missing data fields and the like, so that data processing (such as data cleaning) becomes important. The data cleaning is an important link of data preprocessing, and is to perform operations such as screening, conversion, correction, deletion and the like on original data so as to remove errors, inconsistencies, repetition and the like in the data, thereby ensuring the quality and accuracy of the data.

Currently, in order to increase the processing speed and improve the efficiency, a distributed computing system divides data processing into two stages, in a first stage (Map stage), the data is fragmented and distributed to a plurality of computing nodes for parallel processing, and in a second stage (Reduce stage), the processing results of the computing nodes are combined to obtain a final cleaning result. However, the above-mentioned data slicing allocation process has a problem of unbalanced load, specifically, different slices may contain different amounts of data due to different data slices, and some data cleaning operations (such as removing outliers, processing missing values, etc.) may need to consume different computing resources and time. If the load of the data fragments is unbalanced, the cleaning time of some fragments is longer, and the cleaning time of other fragments is shorter, so that the efficiency and quality of the whole data cleaning process are affected.

Disclosure of Invention

In view of the above, the present invention provides a data processing method, apparatus, computer device and storage medium, so as to solve the problem of unbalanced load of sub-nodes in the data processing process.

In a first aspect, the present invention provides a data processing method applied to a master node in a distributed computing system, the distributed computing system including the master node and a plurality of child nodes, the method comprising: performing slicing processing on the data in the first data set to obtain a plurality of data slices; determining the number of data fragments currently allocated to each sub-node according to the load condition when each sub-node processes the last allocated data fragment, wherein the load condition is represented by the utilization rate of a central processing unit of the corresponding sub-node, the weight coefficient of the utilization rate of the central processing unit, the utilization rate of a memory, the weight coefficient of the utilization rate of the memory, the utilization rate of a network bandwidth and/or the weight coefficient of the utilization rate of the network bandwidth; according to the number of the data fragments currently distributed by each child node, transmitting the corresponding number of the data fragments to each child node so that each child node processes the data in the data fragments; receiving the processed data fragments from each sub-node; and assembling the data in the plurality of processed data fragments to obtain a second data set.

According to the data processing method provided by the embodiment, after the data in the first data set are subjected to slicing processing to obtain a plurality of data slices, the number of the data slices currently distributed by each sub-node is determined according to the load condition when each sub-node processes the data slices distributed last time, then the corresponding number of the data slices are sent to each sub-node according to the number of the data slices currently distributed by each sub-node, so that each sub-node processes the data in the data slices, then the processed data slices are received from each sub-node, and the data in the plurality of processed data slices are assembled to obtain the second data set. According to the data processing method provided by the embodiment, the main node dynamically adjusts the number of data fragments distributed to each sub-node according to the load condition of each sub-node in the data processing process, and can distribute the data fragments to different sub-nodes more uniformly, so that the calculation load of each sub-node is relatively balanced, the overall efficiency reduction caused by overload of some nodes is avoided, and the efficiency and quality of system data processing are improved.

In an alternative embodiment, determining the number of data slices currently allocated to each child node according to the load condition when each child node in the plurality of child nodes processes the last allocated data slice includes: the method comprises the steps of obtaining load information of each child node, wherein the load information comprises at least one of the utilization rate of a central processing unit, the utilization rate of a memory and the utilization rate of a network bandwidth, and the utilization rate of the central processing unit, the utilization rate of the memory and the utilization rate of the network bandwidth are all represented by numerical values from 0 to 1; according to the load information of each sub-node, determining the load index of the corresponding sub-node, wherein the load index is used for representing the load condition of the corresponding sub-node when processing the last distributed data fragment; converting the load index of each sub-node into a first weight coefficient of the corresponding sub-node, wherein the first weight coefficient is the ratio of the load index of the corresponding sub-node to the sum of the load indexes of all the sub-nodes; acquiring the number of preset data fragments; and determining the number of the data fragments currently allocated to each child node according to the first weight coefficient of each child node and the preset number of the data fragments.

According to the data processing method provided by the embodiment, after the load information of each sub-node is obtained, the load indication of the corresponding sub-node is determined according to the load information of each sub-node so as to represent the load condition when the corresponding sub-node processes the last distributed data fragments, then the load index of each sub-node is converted into the first weight coefficient of the corresponding sub-node, the number of the current distributed data fragments of each sub-node is determined according to the first weight coefficient and the preset data fragment number, and the number of the current distributed data fragments of each sub-node can be more reasonably determined, so that the calculation load of each sub-node is more balanced.

In an alternative embodiment, determining the load index of each sub-node according to the load information of the corresponding sub-node includes: according to the load information of each sub-node, the load index of the corresponding sub-node is determined by the following method:

D＝α×A+β×B+γ×C

Wherein D represents a load index of the corresponding child node, α represents a weight coefficient of the utilization rate of the central processor of the corresponding child node, a represents the utilization rate of the central processor of the corresponding child node, β represents a weight coefficient of the utilization rate of the memory of the corresponding child node, B represents the utilization rate of the memory of the corresponding child node, γ represents a weight coefficient of the utilization rate of the network bandwidth of the corresponding child node, and C represents the utilization rate of the network bandwidth of the corresponding child node.

In this embodiment, after determining load information such as the utilization rate of the central processing unit, the utilization rate of the memory, and the utilization rate of the network bandwidth, the load index of the corresponding child node is determined through a formula, so that the load condition of the corresponding child node when processing the data in the data fragment can be more accurately represented.

In an alternative embodiment, determining the number of data slices currently allocated to each child node according to the first weight coefficient of each child node and the preset number of data slices includes: and determining the product of the first difference value of each child node and the preset data fragment number as the number of the data fragments currently allocated to the corresponding child node, wherein the first difference value is the difference value between 1 and the first weight coefficient of the corresponding child node.

In this embodiment, the product of the first difference value of each child node and the preset data fragment number is determined as the number of the data fragments currently allocated to the corresponding child node, so that the number of the data fragments currently allocated to each child node can be more reasonably determined, and the calculation load of each child node is more balanced.

In an alternative embodiment, the slicing processing is performed on the first data set to obtain a plurality of data slices, including: dividing the data in the first data set into a plurality of clusters based on a preset clustering mode, wherein the similarity of any two data in each cluster in the plurality of clusters is larger than the preset similarity; and performing slicing processing according to the characteristics of each cluster in the plurality of clusters to obtain a plurality of data slices, wherein the similarity of any two data in each data slice in the plurality of data slices is greater than a preset similarity, and the characteristics of the clusters comprise at least one of quantity, density and outliers.

In the implementation, in the data slicing process, the data with similar characteristics is sliced by using a preset clustering mode, and the data with similar characteristics is distributed to the same cluster, so that the consistency of the distribution of the data in the data slicing can be ensured.

In an alternative embodiment, before performing the slicing process on the first data set to obtain a plurality of data slices, the method further includes: preprocessing the original data, wherein the preprocessing comprises at least one of denoising, deduplication and normalization; and extracting the characteristics of the preprocessed original data to obtain a first data set, wherein the data in the first data set are the preprocessed original data represented by the vectors with similar characteristics.

In this embodiment, after the original data is acquired, the original data is preprocessed, and feature extraction is performed on the preprocessed original data to obtain the first data set, so that the quality and usability of the data can be ensured.

In an alternative embodiment, before performing the slicing process on the first data set to obtain a plurality of data slices, the method further includes: and determining a preset clustering mode according to the data characteristics in the first data set, wherein the preset clustering mode comprises at least one of a K-means clustering algorithm, a hierarchical clustering algorithm, a density-based clustering algorithm and a fuzzy clustering algorithm.

In this embodiment, a preset clustering mode is determined according to the data characteristics in the first data set, and then the data in the data set is divided into a plurality of clusters based on the preset clustering mode, so that the data with similar characteristics can be more accurately divided into the same cluster.

In an alternative embodiment, each of the plurality of data slices is provided with an identity, and after receiving the processed data slices from each child node, the method further comprises: determining whether missing data fragments exist or not by comparing the identity of the data fragments with the received identity of the processed data fragments; and in the case of the missing data fragments, sending the missing data fragments to a target child node to process the data in the missing data fragments, wherein the target child node is one of a plurality of child nodes.

In the implementation, after receiving the data fragments processed by each sub-node, the master node determines whether missing data fragments exist by comparing the identity of the data fragments with the identity of the received processed data fragments, and sends the missing data fragments to the target sub-node under the condition that the missing data fragments exist so as to process the data in the missing data fragments, so that the data can be tracked and recovered when faults occur or the data is lost, the data integrity is ensured, and the fault tolerance of the system is improved.

In an alternative embodiment, in the case that the data fragments are sent to each child node for the first time, the number of data fragments currently allocated to each child node is a preset number of data fragments.

In a second aspect, the present invention provides a data processing apparatus, the apparatus comprising: the data slicing module is used for slicing the data in the first data set to obtain a plurality of data slices; the load balancing module is used for determining the number of the data fragments currently distributed by each sub-node according to the load condition when each sub-node processes the last distributed data fragments, wherein the load condition is represented by the utilization rate of a central processing unit of the corresponding sub-node, the weight coefficient of the utilization rate of the central processing unit, the utilization rate of a memory, the weight coefficient of the utilization rate of the memory, the utilization rate of a network bandwidth and/or the weight coefficient of the utilization rate of the network bandwidth; the sending module is used for sending the corresponding number of the data fragments to each child node according to the number of the data fragments currently distributed by each child node so that each child node processes the data in the data fragments; the receiving module is used for receiving the data fragments processed by each sub-node; and the data assembling module is used for assembling the data in the plurality of processed data fragments to obtain a second data set.

In an alternative embodiment, a load balancing module includes: the first acquisition unit is used for acquiring load information of each child node, wherein the load information comprises at least one of the utilization rate of the central processing unit, the utilization rate of the memory and the utilization rate of the network bandwidth, and the utilization rate of the central processing unit, the utilization rate of the memory and the utilization rate of the network bandwidth are all represented by numerical values from 0 to 1; the first determining unit is used for determining the load index of the corresponding sub-node according to the load information of each sub-node, wherein the load index is used for representing the load condition of the corresponding sub-node when processing the last distributed data fragment; the conversion unit is used for converting the load index of each sub-node into a first weight coefficient of the corresponding sub-node, wherein the first weight coefficient is the ratio of the load index of the corresponding sub-node to the sum of the load indexes corresponding to all the sub-nodes; the second acquisition unit is used for acquiring the preset data fragment number; and the second determining unit is used for determining the number of the data fragments currently allocated to each child node according to the first weight coefficient of each child node and the preset number of the data fragments.

In an alternative embodiment, the first determining unit comprises: the first sub-determining unit is used for determining the load index of each corresponding sub-node according to the load information of the corresponding sub-node in the following manner:

D＝α×A+β×B+γ×C

In an alternative embodiment, the second determining unit comprises: and the second determining subunit is used for determining the product of the first difference value of each child node and the preset data fragment number as the number of the data fragments currently allocated to the corresponding child node, wherein the first difference value is the difference value between 1 and the first weight coefficient of the corresponding child node.

In an alternative embodiment, a data slicing module comprises: the first dividing unit is used for dividing the data in the first data set into a plurality of clustering clusters based on a preset clustering mode, wherein the similarity of any two data in each of the plurality of clustering clusters is larger than the preset similarity; the second dividing unit is used for carrying out slicing processing according to the characteristics of each cluster in the plurality of clusters to obtain a plurality of data slices, wherein the similarity of any two data in each data slice in the plurality of data slices is larger than the preset similarity, and the characteristics of the clusters comprise at least one of quantity, density and outliers.

In an alternative embodiment, the apparatus further comprises: the preprocessing module is used for preprocessing the original data, wherein the preprocessing comprises at least one of denoising, deduplication and normalization; and the feature extraction module is used for carrying out feature extraction on the preprocessed original data to obtain a first data set, wherein the data in the first data set are the preprocessed original data represented by vectors with similar features.

In an alternative embodiment, the load balancing module further comprises: and the third determining unit is used for determining a preset clustering mode according to the data characteristics in the first data set, wherein the preset clustering mode comprises at least one of a K-means clustering algorithm, a hierarchical clustering algorithm, a density-based clustering algorithm and a fuzzy clustering algorithm.

In an alternative embodiment, each of the plurality of data slices is provided with an identity, and after receiving the processed data slices from each child node, the apparatus further comprises: the comparison module is used for determining whether missing data fragments exist or not by comparing the identity of the data fragments with the received identity of the processed data fragments; and the sending module is used for sending the missing data fragments to the target child node under the condition that the missing data fragments exist so as to process the data in the missing data fragments, wherein the target child node is one of a plurality of child nodes.

In a third aspect, the present invention provides a computer device comprising: the memory and the processor are in communication connection, the memory stores computer instructions, and the processor executes the computer instructions to perform the method of the first aspect or any implementation manner corresponding to the first aspect.

In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of the first aspect or any of its corresponding embodiments.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the related art, the drawings that are required to be used in the description of the embodiments or the related art will be briefly described, and it is apparent that the drawings in the description below are some embodiments of the present invention, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.

FIG. 1 is a flow chart of a data processing method according to an embodiment of the invention;

FIG. 2 is a schematic diagram of a data slicing process according to an embodiment of the present invention;

FIG. 3 is a flow chart of another data processing method according to an embodiment of the present invention;

FIG. 4 is a flow chart of yet another data processing method according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a data slicing process according to an embodiment of the present invention;

FIG. 6 is a flow chart of yet another data processing method according to an embodiment of the present invention;

FIG. 7 is a block diagram of a data processing apparatus according to an embodiment of the present invention;

FIG. 8 is a block diagram of a data processing apparatus according to an embodiment of the present invention;

fig. 9 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The data processing method provided by the invention can be applied to the field of data cleaning, and can also be applied to data processing tasks in other fields, such as the fields of data analysis, machine learning or image processing. For example, in the field of large-scale data analysis, the data can be scattered to a plurality of computing nodes for parallel processing through the invention, so that the efficiency and the speed of data analysis are improved. In the field of machine learning, the large-scale training data can be processed in a slicing way through the method, so that a machine learning model can be trained on a plurality of computing nodes in parallel, and the model training speed is increased.

The following description will mainly take data cleaning as an example, but the data processing method provided by the invention is not limited thereto.

With the rapid development of the internet and information technology, the scale and variety of data have exponentially increased. From traditional structured data to semi-structured and unstructured data, the form and source of the data has become more diverse. For example, with the advent of emerging technologies such as social media, internet of things, and sensors, various types of data such as text, images, and audio are continually emerging. The data has huge scale and uneven quality, and in the process of data acquisition and storage, errors, missing, repetition or inconsistency and other problems often exist in the data due to human factors, equipment faults, transmission errors and other reasons. Such as data entry errors, non-uniform data formats, missing data fields, etc. These problems affect the accuracy and reliability of subsequent data analysis and mining, and therefore require data cleansing to correct and address these problems.

The data cleaning refers to screening, converting, correcting, deleting and other operations on the original data to remove errors, inconsistencies or repetition in the data, so as to ensure the quality and accuracy of the data. The traditional data cleaning method mainly adopts a single machine process, namely, data is loaded into a computer, and then a serial mode is used for cleaning operation. This approach works well with smaller amounts of data, but with the increase in data volume, stand-alone processing has failed to meet the demand.

In order to solve the problem of low data cleaning efficiency, a distributed data cleaning method has been developed. The distributed data cleaning is a method for distributing data on a plurality of computers to perform parallel processing, and the data is fragmented and distributed to the plurality of computers, so that the data cleaning operation is performed simultaneously by utilizing a plurality of computing resources in a cluster, thereby accelerating the processing speed and improving the efficiency. Existing distributed computing coefficients divide the data processing process into two phases: map phase (mapping process) and Reduce phase (induction process). In the Map stage, the data is fragmented and distributed to a plurality of computing nodes for parallel processing; in the Reduce stage, the results of all the computing nodes are combined to obtain a final cleaning result, so that the method has good expandability and fault tolerance and is suitable for cleaning tasks of large-scale data.

However, the existing distributed data cleaning method also has some defects, for example, the Map stage involves slicing and distributing data, which may cause problems of data inclination and unbalanced load, and further cause resource waste and performance degradation.

In view of this, the present invention provides a data processing method, which dynamically adjusts the number of data fragments according to the load condition of each computing node (sub-node) in the data processing process, so as to ensure the load balance of the sub-nodes and improve the efficiency and quality of data processing.

In accordance with an embodiment of the present invention, a data processing method embodiment is provided, it being noted that the steps illustrated in the flowchart of the figures may be performed at a master node in a distributed computing system, such as a set of computer executable instructions, and, although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order other than that illustrated herein.

In this embodiment, a data processing method is provided, which may be used for a master node in a distributed computing system, where the distributed computing system includes a master node and a plurality of child nodes, where the master node and the plurality of child nodes may be devices such as a server, a computer, or a tablet, and the master node is communicatively connected to each of the plurality of child nodes. FIG. 1 is a flow chart of a data processing method according to an embodiment of the invention, as shown in FIG. 1, the method includes the following steps:

Step S101, performing slicing processing on the data in the first data set to obtain a plurality of data slices.

Specifically, data slicing refers to decomposing a large-scale data set into smaller units (sub-data sets) so as to facilitate parallel processing of data by multiple sub-nodes, improving computational efficiency and reducing storage overhead. That is, after the first data set is acquired, the data in the first data set is divided into a plurality of sub-data sets according to a preset rule, so as to obtain a plurality of data fragments. For example, the preset rule may be a similarity between data, for example, data having a similarity higher than the preset similarity is divided into one data fragment.

The first data set may be a data set obtained after processing original data, where the original data may be derived from a database, an application program, a sensor, or the like.

Step S102, according to the load condition when each of the plurality of sub-nodes processes the last allocated data fragment, the number of the data fragments currently allocated by each sub-node is determined.

When the master node sends the data fragments to each of the plurality of child nodes for the first time, the master node distributes the data fragments with the preset data fragment number to each child node. The number of the preset data slices is a preset value, and can be determined by a worker according to the performance of the child node.

Specifically, after the primary node divides the first data set into a plurality of data fragments, the data fragments with the preset data fragment number can be sent to each child node for the first time, after each child node receives the data fragments, the data in the data fragments starts to be cleaned, in the process of cleaning the data, the primary node monitors the load condition when each child node in the plurality of child nodes processes the distributed data fragments, and then determines the number of the data fragments distributed next time by each child node according to the current load condition of each child node. For example, the current load condition of the child node is inversely proportional to the number of data slices allocated next by the child node, and the higher the current load of the child node, the fewer the number of data slices allocated next by the child node.

As shown in fig. 2, the distributed computing system includes 4 sub-nodes, the preset number of data slices is 10, the master node divides the first data set into 1000 data slices, when the master node sends the data slices to the 4 nodes for the first time, the master node allocates 10 data slices to each sub-node, when the master node sends the data slices to the 4 nodes for the second time, if the load of the 4 sub-nodes is in the order of sub-node 4, sub-node 3, sub-node 1 and sub-node 2 from big to small, the master node may allocate 2 data slices to the sub-node 4, allocate 4 data slices to the sub-node 3, allocate 6 data slices to the sub-node 1 and allocate 8 data slices to the sub-node 2.

Illustratively, the load condition may be characterized by a utilization of a central processor of the corresponding child node, a weight coefficient of the utilization of the central processor, a utilization of the memory, a weight coefficient of the utilization of the memory, a utilization of the network bandwidth, and/or a weight coefficient of the utilization of the network bandwidth. For example, the load condition of the child node 1 may be characterized by the product of the utilization of the central processor and the weighting coefficient of the utilization of the central processor.

Step S103, according to the number of the data fragments currently distributed by each child node, the corresponding number of the data fragments is sent to each child node, so that each child node processes the data in the data fragments.

Illustratively, as shown in fig. 2, after the master node determines the number of data slices currently allocated to each child node, a corresponding number of data slices is sent to each child node, and accordingly, each child node receives the corresponding number of data slices and processes the data in the data slices, e.g., cleans the data in the data slices.

Step S104, receiving the processed data fragments from each sub-node.

Specifically, as shown in fig. 2, after each child node processes (cleans) the data in the data fragments, the cleaned data fragments are sent to the master node, and accordingly, the master node receives the processed data fragments sent from each child node.

Step S105, assembling the data in the plurality of processed data slices to obtain a second data set.

Specifically, after receiving the processed data fragments sent by each child node, the master node assembles data in the plurality of processed data fragments to obtain a second data set. Illustratively, when the data assembling operation is performed on the master node, the processed data sent by each child node is assembled according to the characteristics and the requirements of the data in a required mode. For example, if the data is structured data, the assembly may be performed using a relational database, and if the data is unstructured data, the assembly may be performed using text processing or image processing, etc.

According to the data processing method provided by the embodiment, after the data in the first data set are subjected to slicing processing to obtain a plurality of data slices, the number of the data slices currently distributed by each sub-node is determined according to the load condition when each sub-node processes the data slices distributed last time, then the corresponding number of the data slices are sent to each sub-node according to the number of the data slices currently distributed by each sub-node, so that each sub-node processes the data in the data slices, then the data slices processed by each sub-node are received, and the data in the data slices processed by the plurality of sub-nodes are assembled to obtain the second data set. According to the data processing method provided by the embodiment, the main node dynamically adjusts the number of data fragments distributed to each sub-node according to the load condition of each sub-node in the data processing process, so that data can be distributed to different sub-nodes more uniformly, the calculation load of each sub-node is relatively balanced, the overall efficiency reduction caused by overload of some nodes is avoided, and the data processing efficiency and quality are improved.

In this embodiment, a data processing method is provided, which may be used for a master node in a distributed computing system, and fig. 3 is a schematic flow chart of another data processing method according to an embodiment of the present invention, as shown in fig. 3, and the method includes the following steps:

in step S301, the data in the first data set is subjected to slicing processing, so as to obtain a plurality of data slices.

Please refer to step S101 in the embodiment shown in fig. 1 in detail, which is not described herein.

Step S302, according to the load condition when each of the plurality of sub-nodes processes the last allocated data fragment, the number of the data fragments currently allocated by each sub-node is determined.

Specifically, the step S302 includes:

In step S3021, load information of each child node is acquired.

The load information comprises at least one of the utilization rate of the central processing unit, the utilization rate of the memory and the utilization rate of the network bandwidth, and the utilization rate of the central processing unit, the utilization rate of the memory and the utilization rate of the network bandwidth are all represented by values from 0 to 1.

Specifically, after load information such as the utilization rate of the central processing unit, the utilization rate of the memory, and the utilization rate of the network bandwidth is determined, the load information is normalized, for example, the load information may be normalized to be within the range of [0,1 ]. That is, load information such as the utilization of the cpu, the utilization of the memory, and the utilization of the network bandwidth are represented by values from 0 to 1. For example, the utilization of the cpu may be 0, 0.2, 0.5, or 1, the utilization of the memory may be 0, 0.4, 0.6, or 1, and the utilization of the network bandwidth may be 0, 0.3, 0.8, or 1. Wherein 0 indicates idle and 1 indicates full utilization.

In step S3022, according to the load information of each child node, the load index of the corresponding child node is determined.

The load index is used for representing the load condition of the corresponding child node when the last distributed data fragment is processed.

The invention does not limit the mode of determining the load index of the corresponding sub-node according to the load information of each sub-node.

In some alternative embodiments, the sum of the utilization of the central processor, the utilization of the memory, and the utilization of the network bandwidth may be directly determined as the load index of the corresponding child node.

In other alternative embodiments, according to the load information of each child node, the load index of the corresponding child node may be determined by the following formula (1):

D＝α×A+β×B+γ×C (1)

Wherein D represents a load index of the corresponding child node, α represents a weight coefficient of a central processor utilization of the corresponding child node, a represents a central processor utilization of the corresponding child node, β represents a weight coefficient of a memory utilization of the corresponding child node, B represents a memory utilization of the corresponding child node, γ represents a weight coefficient of a network bandwidth utilization of the corresponding child node, and C represents a network bandwidth utilization of the corresponding child node. For example, the weighting coefficients of the cpu utilization, the memory utilization, and the network bandwidth utilization may be preset values, which are empirically set by a worker. For example, the weight coefficient a of the cpu utilization may be 0.5, the weight coefficient of the memory utilization may be 0.3, and the weight coefficient of the network bandwidth utilization may be 0.2.

In this embodiment, after determining the load information such as the utilization rate of the central processing unit, the utilization rate of the memory, and the utilization rate of the network bandwidth, the load index of the corresponding child node is determined according to the above formula (1), so that the load condition of the corresponding child node when processing the data in the data fragment can be more accurately represented.

In step S3023, the load index of each child node is converted into a first weight coefficient of the corresponding child node.

The first weight coefficient is the ratio of the load index of the corresponding child node to the sum of the load indexes of all the child nodes. That is, the load index of each child node may be converted into a corresponding first weight coefficient by the following formula (2):

First weight coefficient = the sum of the load indices of the child nodes/the load indices corresponding to all child nodes (2)

Specifically, the sum of the first weight coefficients corresponding to all the child nodes is 1.

In step S3024, the preset number of data slices is obtained.

Specifically, the preset number of data fragments is a preset value, and can be set by a worker. The preset data slicing number is the number of data slicing sent to each child node by the master node for the first time.

In step S3025, the number of data slices currently allocated to each child node is determined according to the first weight coefficient of each child node and the preset number of data slices.

In an alternative embodiment, the product of the first difference value of each child node and the preset number of data slices is determined as the number of data slices currently allocated to the corresponding child node.

The first difference value is a difference value between 1 and a first weight coefficient of the corresponding child node.

For example, the distributed computing system includes three sub-nodes, namely, a sub-node 1, a sub-node 2 and a sub-node 3, where the preset number of data slices is 10, if the first weight coefficient corresponding to the sub-node 1 determined by the above steps is 0.5, the first weight coefficient corresponding to the sub-node 2 is 0.3 and the first weight coefficient corresponding to the sub-node 3 is 0.2, the first difference corresponding to the sub-node 1 is (1-0.5), the first difference corresponding to the sub-node 2 is (1-0.3) and the first difference corresponding to the sub-node 3 is (1-0.2), at this time, the number of data slices currently allocated by the sub-node 1 is 5= (1-0.5) ×10, the number of data slices currently allocated by the sub-node 2 is 7= (1-0.3) ×10, and the number of data slices currently allocated by the sub-node 3 is 8= (1-0.2).

In another alternative embodiment, the difference between the preset number of data slices and the preset gradient corresponding to each child node is determined as the number of data slices currently allocated to the corresponding child node.

The preset gradient corresponding to each child node is the product of the first weight coefficient of each child node and the preset data fragment number.

For example, the distributed computing system includes the child node 1 and the child node 2, the preset number of data slices is set to be 20, if the first weight coefficient corresponding to the child node 1 determined by the above steps is 0.8, the first weight coefficient corresponding to the child node 2 is 0.2, the preset gradient corresponding to the child node 1 is 16=0.8x20, the preset gradient corresponding to the child node 2 is 4=0.2x20, at this time, the number of data slices currently allocated to the child node 1 is 4=20-16, and the number of data slices currently allocated to the child node 2 is 16=20-4.

In this embodiment, the number of the data fragments currently allocated to the corresponding child node is determined through the preset gradient, so that the number of the data fragments currently allocated to the corresponding child node can be more conveniently and rapidly determined.

Step S303, according to the number of the data fragments currently distributed by each child node, a corresponding number of the data fragments are sent to each child node, so that each child node processes the data in the data fragments.

Please refer to step S103 in the embodiment shown in fig. 1 in detail, which is not described herein.

Step S304, receiving the processed data fragments from each sub-node.

Please refer to step S104 in the embodiment shown in fig. 1 in detail, which is not described herein.

Step S305, assembling the data in the plurality of processed data slices to obtain a second data set.

Please refer to step S105 in the embodiment shown in fig. 1 in detail, which is not described herein.

In this embodiment, a data processing method is provided, which may be used for a master node in a distributed computing system, and fig. 4 is a schematic flow chart of a data processing method according to an embodiment of the present invention, as shown in fig. 4, where the method includes the following steps:

Step S401, preprocessing is performed on the original data.

Wherein the preprocessing includes at least one of denoising, deduplication, and normalization.

Specifically, denoising may refer to removing data noise in the original data, deduplication may refer to removing duplicate data in the original data, normalization may refer to mapping the data to a specified range for removing the dimensional effects of different dimensional data. Where data noise refers to interference data in the data set (e.g., data that is inaccurate for scene description), i.e., random errors or variances in the measured variables, denoising may be filtered, e.g., mean filtering, median filtering, or complex filtering, etc.

Step S402, extracting features of the preprocessed original data to obtain a first data set.

Wherein the data in the first data set is pre-processed raw data represented by vectors having similar characteristics.

Specifically, a proper feature extraction method can be selected according to the characteristics and service requirements of the preprocessed original data, and the original data is converted into vector representations with similar features to obtain a first data set. Such as statistical features, text features, image features, time series features, etc.

Step S403, performing slicing processing on the data in the first data set to obtain a plurality of data slices.

Specifically, the step S403 includes:

Step S4031, dividing the data in the first data set into a plurality of clusters based on the preset clustering mode.

The similarity of any two data in each cluster in the plurality of clusters is larger than the preset similarity. Specifically, the preset similarity is a preset value, and can be determined by a worker.

In some alternative embodiments, prior to step S4031, the data processing method further includes: and determining a preset clustering mode according to the data characteristics in the first data set.

The preset clustering mode comprises at least one of a K-means clustering algorithm, a hierarchical clustering algorithm, a density-based clustering algorithm and a fuzzy clustering algorithm.

For example, the data in the first data set may be continuous or non-discrete, for example, if the data in the first data set is time-series data, the preset clustering mode may be determined as a K-means clustering algorithm, if the data in the first data set is data having a hierarchical structure, the preset clustering mode may be determined as a hierarchical clustering algorithm, if the data in the first data set is data having an irregular shape, the preset clustering mode may be determined as a density-based clustering algorithm, and if the data in the first data set is data having a fuzzy property, the preset clustering mode may be determined as a fuzzy clustering algorithm. In addition, a suitable clustering algorithm may be selected according to factors such as the size, the dimension, or the noise of the data in the first dataset.

The following describes the above step S4031 by taking a preset clustering method as an example of a K-means clustering algorithm.

Specifically, as shown in fig. 5, after the first data set is determined, data points are primarily divided into K clusters based on a K-means clustering algorithm, then distances between each data point in the first data set and centers of the K clusters are calculated, and the data points are moved to the cluster where the closest cluster center is located.

Illustratively, the distance of a data point to the cluster center can be determined by the following equation (3):

Where ρ represents the distance of the data point to the cluster center, (x ₁,y₁) represents the coordinates of the cluster center, and (x ₂,y₂) represents the coordinates of the data point. The smaller the distance, the higher the similarity between the data point and the cluster center.

And step S4032, performing slicing processing according to the characteristics of each cluster in the plurality of clusters to obtain a plurality of data slices.

The similarity of any two data in each data fragment in the plurality of data fragments is larger than the preset similarity, and the characteristics of the cluster comprise at least one of quantity, density and outliers.

Specifically, the number represents the number of data in the cluster, the density represents the degree of tightness of data distribution in the cluster, and the cluster with higher density represents the closer distance between data points, and the similarity between the data points and the cluster center of the cluster is higher. Outliers refer to data points that have significantly different characteristics compared to other data points.

For example, when the number of clusters is greater than the first preset number, or when the density of clusters is greater than the preset density, that is, when the data amount of a certain cluster is greater or the density is higher, the data in the cluster may be divided into a plurality of data fragments, so as to balance the query load and improve the parallel query efficiency. When the number of the clusters is smaller than or equal to the first preset number, or when the density of the clusters is smaller than or equal to the preset density, that is, when the data volume of a certain cluster is smaller or the density is lower, one cluster can be divided into one data fragment, so that the data transmission cost is reduced and the query efficiency is improved. When the cluster contains the outliers which are larger than or equal to the second preset number, namely when the cluster contains a large number of outliers, the data in the cluster can be distributed into fewer data fragments, so that the query load on the outliers is reduced, the query efficiency is improved, and when the cluster contains the outliers which are smaller than the second preset number, the data in the cluster can be divided into a plurality of data fragments, the parallel query efficiency is improved, and the data transmission cost is reduced.

Specifically, in the conventional data slicing manner, due to the characteristics of data itself or the selection of slicing strategies, there may be situations that there is a difference or inconsistency between data in different slices. For example, suppose there is a data set containing a user purchase record, and the data needs to be divided into different pieces of data according to the user's purchase behavior. If the slicing policy slices according to the Identity (ID) of the user, the following situations may occur: the purchase records of users are uneven, and because of large differences of purchase behaviors of different users, certain users may purchase a large amount of commodities, while other users purchase fewer things, so that the data volume in certain data fragments is larger, and the data volume in other data fragments is smaller, and the data volume is inconsistent. If the slicing policy is to perform slicing according to time, a change of purchasing behavior of a user may occur, and the purchasing behavior of the user may change with time, for example, some users purchase a lot of commodities in a certain period of time and purchase fewer commodities in other periods of time, which may cause purchase records of the same user to be sliced into different data slices, resulting in inconsistent data. Data inconsistencies in data shards may have an impact on subsequent data processing and analysis.

In the implementation, the data with similar characteristics are segmented by using a preset clustering mode in the data segmentation process, and the data with similar characteristics are distributed to the same cluster, so that the distribution consistency of the data after data segmentation can be ensured.

Step S404, according to the load condition when each of the plurality of sub-nodes processes the last allocated data fragment, the number of the data fragments currently allocated by each sub-node is determined.

Please refer to step S102 in the embodiment shown in fig. 1 or step S302 in the embodiment shown in fig. 3 in detail, which will not be described herein.

Step S405, according to the number of the data fragments currently allocated to each child node, sending a corresponding number of the data fragments to each child node, so that each child node processes the data in the data fragments.

In step S406, the processed data fragments from each child node are received.

Step S407, determining whether there is a missing data fragment by comparing the identity of the data fragment with the received identity of the processed data fragment.

Specifically, after the data in the first data set is subjected to the slicing process to obtain a plurality of data slices, an Identification (ID) is configured for each of the plurality of data slices. When the master node sends the data fragments to each sub-node, the sent data fragments contain the identity marks of the data fragments, and after each sub-node receives the distributed data fragments, the identity marks corresponding to the data fragments are not changed when the data in the data fragments are processed.

After receiving the processed data fragments sent by each child node, the master node can determine whether missing data fragments exist according to the stored ID lists of the plurality of data fragments and comparing the identification of the data fragments with the identification of the received processed data fragments. For example, if the ID list in the processed data fragment does not correspond to the ID list of the stored plurality of data fragments one by one, there is a missing data fragment, and if the ID list in the processed data fragment corresponds to the ID list of the stored plurality of data fragments one by one, there is no missing data fragment. In the case where there is a missing data slice, step S408 is performed, and in the case where there is no missing data slice, step S409 is performed.

In step S408, in the case where there is a missing data slice, the missing data slice is sent to the target child node to process the data in the missing data slice.

Wherein the target child node is one of a plurality of child nodes.

Specifically, the missing data fragments can be determined through the stored ID lists of the plurality of data fragments, and after the missing data fragments are determined, the missing data fragments are sent to the target child nodes, so that the target child nodes process the data in the missing data fragments. After the target child node finishes processing, the processed missing data fragments are sent to the main node, the main node receives the processed missing data fragments and verifies the processed missing data fragments again, and the correct association relation between the data fragments after the washing and the original data fragments is ensured.

Step S409, assembling the data in the plurality of processed data slices to obtain a second data set.

According to the data processing method provided by the embodiment, after the original data is acquired, the original data is preprocessed, and the preprocessed original data is subjected to feature extraction to obtain the first data set, so that the quality and usability of the data can be guaranteed. After the first data set is determined, dividing the data in the first data set into a plurality of clusters based on a preset clustering mode, and then carrying out slicing processing according to the characteristics of each cluster to obtain a plurality of data slices, so that the data in the same data slice has higher similarity, the data slices have higher consistency, the data processing operation can be better carried out, and the problem of inconsistent data is reduced. After the data fragments are obtained, the number of the data fragments distributed to each child node is dynamically adjusted according to the load condition of each child node, so that the uniformity of the fragments can be maintained, and the efficiency and quality of data processing are improved. After receiving the data fragments processed by each sub-node, the master node determines whether missing data fragments exist by comparing the identity of the data fragments with the identity of the received processed data fragments, and sends the missing data fragments to the target sub-node under the condition that the missing data fragments exist so as to process the data in the missing data fragments, so that the data can be tracked and recovered when faults or data losses occur, the integrity of the data is ensured, and the fault tolerance of the system is improved.

The flow of data processing provided by the present invention will be described in detail below with reference to fig. 6 by taking data cleansing as an example.

Illustratively, the original data is first acquired, after the original data is acquired, the original data is preprocessed, and then the feature extraction is performed on the preprocessed original data, so as to obtain a first data set, that is, step S601, step S602, and step S603 are performed. After the first data set is obtained, the first data set may be subjected to data slicing by the data slicing manner mentioned in any of the above embodiments to obtain a plurality of data slices, and then based on the determined number of data slices currently allocated to each child node, a corresponding number of data slices are sent to the child node 1, the child node 2 and the child node n in fig. 6, that is, step S604 and step S605 are performed. Then after receiving the processed data fragments from the child node 1, the child node 2 and the child node n, determining whether the missing data fragments exist or not by comparing the identity of the data fragments with the identity of the received processed data fragments, namely, performing data verification, and determining whether data washing is required or not, namely, performing step S606 and step S607, re-performing step S605 when the data washing is required, and performing step S608 to perform data assembly when the data washing is not required, so as to obtain a second data set.

In this embodiment, a data processing device is further provided, and the device is used to implement the foregoing embodiments and preferred embodiments, and will not be described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

The present embodiment provides a data processing apparatus, as shown in fig. 7, including:

a data slicing module 701, configured to perform slicing processing on data in the first data set to obtain a plurality of data slices;

The load balancing module 702 is configured to determine, according to a load condition when each of the plurality of child nodes processes a last allocated data fragment, a number of data fragments currently allocated to each child node, where the load condition is represented by a utilization rate of a central processor of the corresponding child node, a weight coefficient of the utilization rate of the central processor, a utilization rate of a memory, a weight coefficient of the utilization rate of the memory, a utilization rate of a network bandwidth, and/or a weight coefficient of the utilization rate of the network bandwidth;

A sending module 703, configured to send a corresponding number of data fragments to each child node according to the number of data fragments currently allocated to each child node, so that each child node processes data in the data fragments;

a receiving module 704, configured to receive the processed data fragments from each child node;

the data assembling module 705 is configured to assemble data in the plurality of processed data fragments to obtain a second data set.

In some alternative embodiments, the load balancing module 702 includes:

The first acquisition unit is used for acquiring load information of each child node, wherein the load information comprises at least one of the utilization rate of the central processing unit, the utilization rate of the memory and the utilization rate of the network bandwidth, and the utilization rate of the central processing unit, the utilization rate of the memory and the utilization rate of the network bandwidth are all represented by numerical values from 0 to 1;

the first determining unit is used for determining the load index of the corresponding sub-node according to the load information of each sub-node, wherein the load index is used for representing the load condition of the corresponding sub-node when processing the last distributed data fragment;

The conversion unit is used for converting the load index of each sub-node into a first weight coefficient of the corresponding sub-node, wherein the first weight coefficient is the ratio of the load index of the corresponding sub-node to the sum of the load indexes corresponding to all the sub-nodes;

The second acquisition unit is used for acquiring the preset data fragment number;

And the second determining unit is used for determining the number of the data fragments currently allocated to each child node according to the first weight coefficient of each child node and the preset number of the data fragments.

In some alternative embodiments, the first determining unit includes:

The first sub-determining unit is used for determining the load index of each corresponding sub-node according to the load information of the corresponding sub-node in the following manner:

D＝α×A+β×B+γ×C

In some alternative embodiments, the second determining unit includes:

And the second determining subunit is used for determining the product of the first difference value of each child node and the preset data fragment number as the number of the data fragments currently allocated to the corresponding child node, wherein the first difference value is the difference value between 1 and the first weight coefficient of the corresponding child node.

In some alternative embodiments, data slicing module 701 comprises:

The first dividing unit is used for dividing the data in the first data set into a plurality of clustering clusters based on a preset clustering mode, wherein the similarity of any two data in each of the plurality of clustering clusters is larger than the preset similarity;

The second dividing unit is used for carrying out slicing processing according to the characteristics of each cluster in the plurality of clusters to obtain a plurality of data slices, wherein the similarity of any two data in each data slice in the plurality of data slices is larger than the preset similarity, and the characteristics of the clusters comprise at least one of quantity, density and outliers.

In some alternative embodiments, the apparatus further comprises:

The preprocessing module is used for preprocessing the original data, wherein the preprocessing comprises at least one of denoising, deduplication and normalization;

And the feature extraction module is used for carrying out feature extraction on the preprocessed original data to obtain a first data set, wherein the data in the first data set are the preprocessed original data represented by vectors with similar features.

In some alternative embodiments, the load balancing module 702 further comprises:

And the third determining unit is used for determining a preset clustering mode according to the data characteristics in the first data set, wherein the preset clustering mode comprises at least one of a K-means clustering algorithm, a hierarchical clustering algorithm, a density-based clustering algorithm and a fuzzy clustering algorithm.

In some optional embodiments, each of the plurality of data slices is provided with an identity, and after receiving the processed data slices from each child node, the apparatus further comprises:

The comparison module is used for determining whether missing data fragments exist or not by comparing the identity of the data fragments with the received identity of the processed data fragments;

And the sending module is used for sending the missing data fragments to the target child node under the condition that the missing data fragments exist so as to process the data in the missing data fragments, wherein the target child node is one of a plurality of child nodes.

In some optional embodiments, in a case where the data fragments are sent to each child node for the first time, the number of data fragments currently allocated to each child node is a preset number of data fragments.

Further functional descriptions of the above respective modules and units are the same as those of the above corresponding embodiments, and are not repeated here.

The data processing apparatus in this embodiment is presented in the form of a functional unit, where the unit refers to an Application SPECIFIC INTEGRATED Circuit (ASIC), a processor and a memory executing one or more software or fixed programs, and/or other devices that can provide the above-described functions.

An embodiment of the present invention further provides a data processing system, as shown in fig. 8, where the data processing system includes: a data slicing module 701, a data cleansing module 801, a load balancing module 702, a data assembling module 705, and a data cleansing module 802. Specifically, the data cleansing module 801 is configured to send a corresponding number of data fragments to each child node, so that the child node performs cleansing operation on the data in the data fragments, and the data cleansing module 802 is configured to send the missing data fragments to the target child node, so that the target child node performs cleansing on the missing data fragments, and ensures the integrity of the data.

The embodiment of the invention also provides computer equipment, which is provided with the data processing device shown in the figure 7.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a computer device according to an alternative embodiment of the present invention, as shown in fig. 9, the computer device includes: one or more processors 910, a memory 920, and interfaces for connecting components, including high-speed interfaces and low-speed interfaces. The various components are communicatively coupled to each other using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the computer device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In some alternative embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple computer devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 910 is illustrated in fig. 9.

The processor 910 may be a central processing unit, a network processor, or a combination thereof. The processor 910 may further include a hardware chip, among others. The hardware chip may be an application specific integrated circuit, a programmable logic device, or a combination thereof. The programmable logic device may be a complex programmable logic device, a field programmable gate array, a general-purpose array logic, or any combination thereof.

Wherein the memory 920 stores instructions executable by the at least one processor 910 to cause the at least one processor 910 to perform a method for implementing the embodiments described above.

Memory 920 may include a storage program area that may store an operating system, at least one application required for functionality, and a storage data area; the storage data area may store data created according to the use of the computer device, etc. In addition, memory 920 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, memory 920 may optionally include memory located remotely from processor 910, which may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Memory 920 may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as flash memory, hard disk, or solid state disk; memory 920 may also include combinations of the above types of memory.

The computer device also includes a communication interface 930 for the computer device to communicate with other devices or communication networks.

The embodiments of the present invention also provide a computer readable storage medium, and the method according to the embodiments of the present invention described above may be implemented in hardware, firmware, or as a computer code which may be recorded on a storage medium, or as original stored in a remote storage medium or a non-transitory machine readable storage medium downloaded through a network and to be stored in a local storage medium, so that the method described herein may be stored on such software process on a storage medium using a general purpose computer, a special purpose processor, or programmable or special purpose hardware. The storage medium can be a magnetic disk, an optical disk, a read-only memory, a random access memory, a flash memory, a hard disk, a solid state disk or the like; further, the storage medium may also comprise a combination of memories of the kind described above. It will be appreciated that a computer, processor, microprocessor controller or programmable hardware includes a storage element that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the methods illustrated by the above embodiments.

Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims

1. A data processing method for a master node in a distributed computing system, the distributed computing system including the master node and a plurality of child nodes, the method comprising:

performing slicing processing on the data in the first data set to obtain a plurality of data slices;

Determining the number of the data fragments currently allocated to each sub-node according to the load condition when each sub-node processes the last allocated data fragment, wherein the load condition is represented by the utilization rate of a central processing unit of the corresponding sub-node, the weight coefficient of the utilization rate of the central processing unit, the utilization rate of a memory, the weight coefficient of the utilization rate of the memory, the utilization rate of a network bandwidth and/or the weight coefficient of the utilization rate of the network bandwidth;

Transmitting a corresponding number of data fragments to each child node according to the number of the data fragments currently distributed by each child node, so that each child node processes the data in the data fragments;

receiving the processed data fragments from each child node;

and assembling the data in the plurality of processed data fragments to obtain a second data set.

2. The method of claim 1, wherein determining the number of data slices currently allocated by each of the plurality of child nodes based on the load condition when each child node processes the last allocated data slice comprises:

The load information of each child node is obtained, wherein the load information comprises at least one of the utilization rate of a central processing unit, the utilization rate of a memory and the utilization rate of a network bandwidth, and the utilization rate of the central processing unit, the utilization rate of the memory and the utilization rate of the network bandwidth are all represented by numerical values from 0 to 1;

determining a load index of a corresponding child node according to the load information of each child node, wherein the load index is used for representing the load condition of the corresponding child node when processing the last distributed data fragment;

converting the load index of each sub-node into a first weight coefficient of the corresponding sub-node, wherein the first weight coefficient is the ratio of the load index of the corresponding sub-node to the sum of the load indexes of all the sub-nodes;

acquiring the number of preset data fragments;

and determining the number of the data fragments currently distributed by each child node according to the first weight coefficient of each child node and the preset number of the data fragments.

3. The method according to claim 2, wherein the determining the load index of the corresponding child node according to the load information of each child node comprises:

According to the load information of each sub-node, the load index of the corresponding sub-node is determined by the following method:

D＝α×A+β×B+γ×C

4. The method according to claim 2, wherein the determining the number of data slices currently allocated to each child node according to the first weight coefficient of each child node and the preset number of data slices comprises:

And determining the product of the first difference value of each child node and the preset data fragment number as the number of the data fragments currently allocated to the corresponding child node, wherein the first difference value is the difference value between 1 and the first weight coefficient of the corresponding child node.

5. The method according to any one of claims 1 to 4, wherein the slicing the first data set to obtain a plurality of data slices comprises:

Dividing the data in the first data set into a plurality of cluster clusters based on a preset cluster mode, wherein the similarity of any two data in each cluster in the plurality of cluster clusters is larger than the preset similarity;

And performing slicing processing according to the characteristics of each cluster in the plurality of clusters to obtain a plurality of data slices, wherein the similarity of any two data in each data slice in the plurality of data slices is greater than the preset similarity, and the characteristics of the clusters comprise at least one of quantity, density and outliers.

6. The method of claim 5, wherein prior to the fragmenting the first data set to obtain the plurality of data fragments, the method further comprises:

Preprocessing the original data, wherein the preprocessing comprises at least one of denoising, deduplication and normalization;

And extracting features of the preprocessed original data to obtain the first data set, wherein the data in the first data set are the preprocessed original data represented by vectors with similar features.

7. The method of claim 5, wherein prior to the fragmenting the first data set to obtain the plurality of data fragments, the method further comprises:

And determining the preset clustering mode according to the data characteristics in the first data set, wherein the preset clustering mode comprises at least one of a K-means clustering algorithm, a hierarchical clustering algorithm, a density-based clustering algorithm and a fuzzy clustering algorithm.

8. The method according to any one of claims 1 to 4, wherein each of the plurality of data slices is provided with an identity, and wherein after said receiving the processed data slices from said each child node, the method further comprises:

determining whether missing data fragments exist or not by comparing the identity of the data fragments with the received identity of the processed data fragments;

And sending the missing data fragments to a target child node to process the data in the missing data fragments when the missing data fragments exist, wherein the target child node is one of the child nodes.

9. The method according to any one of claims 1 to 4, wherein in case of transmitting the data fragments to each of the child nodes for the first time, the number of data fragments currently allocated to each of the child nodes is a preset number of data fragments.

10. A data processing apparatus, the apparatus comprising:

the data slicing module is used for slicing the data in the first data set to obtain a plurality of data slices;

The load balancing module is used for determining the number of the data fragments currently distributed by each sub-node according to the load condition when each sub-node processes the last distributed data fragments, wherein the load condition is represented by the utilization rate of a central processing unit of the corresponding sub-node, the weight coefficient of the utilization rate of the central processing unit, the utilization rate of a memory, the weight coefficient of the utilization rate of the memory, the utilization rate of a network bandwidth and/or the weight coefficient of the utilization rate of the network bandwidth;

the sending module is used for sending the corresponding number of the data fragments to each child node according to the number of the data fragments currently distributed by each child node so that each child node processes the data in the data fragments;

the receiving module is used for receiving the data fragments processed by each child node;

and the data assembling module is used for assembling the data in the plurality of processed data fragments to obtain a second data set.

11. A computer device, comprising:

A memory and a processor in communication with each other, the memory having stored therein computer instructions which, upon execution, cause the processor to perform the method of any of claims 1 to 9.

12. A computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1 to 9.