CN107229693B

CN107229693B - The method and system of big data system configuration parameter tuning based on deep learning

Info

Publication number: CN107229693B
Application number: CN201710361578.3A
Authority: CN
Inventors: 王宏志; 王艺蒙; 赵志强; 孙旭冉
Original assignee: Da Da Data Industry Co Ltd
Current assignee: Da Da Data Industry Co Ltd
Priority date: 2017-05-22
Filing date: 2017-05-22
Publication date: 2018-05-01
Anticipated expiration: 2037-05-22
Also published as: CN107229693A

Abstract

The present invention provides a kind of method and system of the big data system configuration parameter tuning based on deep learning, wherein method includes：Neural metwork training step, Primary Construction deep neural network, using at least one mapping stipulations parameter as input parameter, using it is to be predicted go out allocation optimum parameter as output parameter, training sample set is used as using the historical data of big data system；Again to map the stipulations time as the measurement standard of the deep neural network, the parameter learning rule based on backpropagation thought is adjusted the weights of every layer of neuron, until the mapping stipulations time meets time cost requirement；Parameter prediction step is configured, sets the initial value of at least one mapping stipulations parameter, and reads current test data, is input in the deep neural network obtained via neural metwork training step, obtains configuration parameter.The present invention carries out tuning by deep neural network to the configuration parameter in mapping stipulations frame, avoids manual adjustment, and the parameter good application effect predicted.

Description

Method and system for optimizing configuration parameters of big data system based on deep learning

Technical Field

The invention relates to the technical field of computers, in particular to a method and a system for optimizing configuration parameters of a big data system based on deep learning.

Background

In recent years, big data exploration and analysis have been vigorously developed in various fields. Big data systems can be divided into 3 levels: (1) base layer: the basic data processing layer allocates hardware resources to the execution platform layer supporting the calculation task; (2) platform layer: the core service layer provides an interface which is easy to process a data set for the application layer and can manage resources distributed by the infrastructure layer; (3) an application layer: namely a prediction result output layer, an expert decision is predicted, and a big data analysis result is given.

The platform layer plays a role in starting and stopping in the big data system and is also a core part of the big data system. MapReduce (mapping convention) in the Hadoop system is a model in the platform layer. Hadoop is a distributed system infrastructure. A user can develop a distributed program without knowing the distributed underlying details. The power of the cluster is fully utilized to carry out high-speed operation and storage. MapReduce is a programming model under Hadoop and is used for parallel operation of large-scale data sets (larger than 1 TB). The method greatly facilitates programmers to run programs on the distributed system without distributed parallel programming. The MapReduce function of Hadoop realizes the breaking of a single task, sends a mapping task (Map) to a plurality of nodes, and then loads a specification (Reduce) into a data warehouse in the form of a single data set.

The setting of the configuration parameters has a great influence on the working performance of MapReduce. Good configuration parameters make MapReduce work well, and configuration parameter errors are the main reasons for performance degradation and system failure of the MapReduce system of Hadoop. To help platform administrators optimize system performance, configuration parameters need to be adjusted to handle different features, different programs, and different input data in pursuit of faster performance. In the traditional method, an administrator adjusts configuration parameters one by one or utilizes linear regression to configure the parameters, extracts parameter characteristics and expresses according to MapReduce operation performance, so that an approximate optimal solution is given, and the configuration parameters are predicted to achieve better working performance.

However, there are two major challenges for administrators to manage the Hadoop system: (1) because the behavior and characteristics of a large-scale distributed system are too complex, proper configuration parameters are difficult to find; (2) hundreds of parameters exist in the system, and dozens of configuration parameters which mainly affect the performance of the system make the adjustment of the configuration parameters troublesome. In the traditional method, manual method or regression automatic parameter adjustment is very complicated, a lot of time is consumed for parameter adjustment, the obtained effect is not good, and the whole work of the system needs a long time.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a method and a system for optimizing configuration parameters of a big data system based on deep learning, aiming at the defects of low efficiency and poor effect of manual method or automatic configuration parameter adjustment by regression in the prior art.

The invention provides a method for optimizing configuration parameters of a big data system based on deep learning, which comprises a neural network training step and a configuration parameter predicting step; wherein,

the neural network training step comprises the following steps:

step 1-1, preliminarily constructing a deep neural network, wherein at least one mapping protocol parameter is used as an input parameter, an optimal configuration parameter to be predicted is used as an output parameter, and historical data of a big data system is used as a training sample set;

step 1-2, taking the mapping reduction time as a measurement standard of the deep neural network, and adjusting the weight of each layer of neurons based on a parameter learning rule of a back propagation idea until the mapping reduction time meets the time cost requirement;

the configuration parameter predicting step includes the steps of:

step 2-1, setting an initial value of the at least one mapping protocol parameter, and reading current test data;

and 2-2, inputting the initial value of the at least one mapping reduction parameter and the current test data into the deep neural network obtained in the neural network training step to obtain the configuration parameters of the big data system based on deep learning.

In the method for optimizing the configuration parameters of the big data system based on deep learning, the number of the at least one mapping reduction parameter is 2-20.

The invention provides a system for optimizing configuration parameters of a big data system based on deep learning, which comprises a neural network training module and a configuration parameter prediction module; wherein,

the neural network training module is used for preliminarily constructing a deep neural network, wherein at least one mapping protocol parameter is used as an input parameter, an optimal configuration parameter to be predicted is used as an output parameter, and historical data of a big data system is used as a training sample set; the mapping reduction time is used as a measurement standard of the deep neural network, and the weight of each layer of neurons is adjusted based on a parameter learning rule of a back propagation idea until the mapping reduction time meets the time cost requirement;

and the configuration parameter prediction module is used for inputting the set initial value of the at least one mapping protocol parameter and the current test data into the deep neural network obtained through the neural network training step to obtain the configuration parameters of the big data system based on deep learning.

In the deep learning-based big data system configuration parameter tuning system, the number of the at least one mapping reduction parameter is 2-20.

The implementation of the method and the system for optimizing the configuration parameters of the big data system based on deep learning has the following beneficial effects: the invention optimizes the configuration parameters in the mapping protocol framework through the deep neural network, avoids the problems of manual adjustment and optimal parameter searching, can obtain the self characteristics and the mutual relation of each configuration parameter more deeply through the learning of historical parameters, and obtains the parameter configuration which is most suitable for the application requirement of an application layer through the repeated learning, weight updating and network prediction of the deep neural network. The invention not only saves the time for adjusting the parameters, but also ensures that the working time of the system is distributed to the compressed and decompressed data by the proper parameters of the system, thereby greatly reducing the writing and transmission time, ensuring that the whole system can work quickly and achieve better working effect.

Drawings

FIG. 1 is a flow chart of a method for configuration parameter tuning for big data systems based on deep learning according to a preferred embodiment of the present invention;

FIG. 2 is a schematic flow chart of the neural network training step in the method according to the preferred embodiment of the present invention;

FIG. 3 is a block diagram of a system for deep learning based big data system configuration parameter tuning, according to a preferred embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

The invention provides a method for optimizing configuration parameters of a big data network by using a deep neural network, which introduces a deep neural network framework into a parameter configuration link, saves time and cost and achieves good working effect. The method mainly aims at learning and optimizing configuration of parameters in a mapping task (Map task) and a specification task (Reduce task) of a big data system. Mapping convention (MapReduce) is a complex process, and the workflow of mapping convention is briefly introduced below, and the main steps of convention mapping are as follows:

map end (mapping end) working process

(1) Each incoming fragment is processed by a map task (mapping task) with the size of one block (with an initial value of 64M) of the distributed file system (HDFS) as one fragment. The Map output result is temporarily placed in a ring memory buffer, the initial value of the size of the buffer is 100M, and the buffer is controlled by the attribute of io. When the buffer is about to overflow (initially set to 80% of the buffer size, controlled by the io. sort. spill. percent attribute), an overflow file is created in the local file system, and the data in the buffer is written to this file.

(2) Before writing into a disk, a thread firstly divides data into partitions with the same number according to the number of reduce tasks (reduction tasks), namely, one reduce task corresponds to the data of one partition. Then, the data in each partition is sorted, Combiner is set, and the sorted result is subjected to Combia (merging) operation.

(3) When the map task outputs the last record, there may be many overflow files that need to be merged. The sorting and Combia operations are continuously performed during the merging process.

(4) And transmitting the data in the partitions to the corresponding reduce tasks.

(II) Reduce end (protocol end) working process

(1) The Reduce receives data from different map tasks, and the data from each map is ordered. If the amount of data received by the reduce end is quite small, it is directly stored in the memory (buffer size, controlled by mapred. If the data amount exceeds a certain proportion of the buffer size (determined by mapred.

(2) And executing the simplification program defined by the application program layer, and finally outputting the data. Compressed as required, written and finally output to the HDFS.

Referring to fig. 1, a flowchart of a method for tuning configuration parameters of a deep learning-based big data system according to a preferred embodiment of the present invention is shown. As shown in fig. 1, the method for tuning configuration parameters of a deep learning-based big data system provided in this embodiment mainly includes a neural network training step and a configuration parameter predicting step:

first, in steps S101 to S102, a neural network training step is performed to construct a deep neural network, using a historical operating state provided by an administrator as a training set, and using predicted optimal configuration parameters as an output. And the time cost of (mapping protocol) MapReduce is taken as the final measurement standard of the network structure, and the structure is continuously fed back and adjusted to obtain the final deep neural network structure. The method comprises the following specific steps:

step S101, a deep neural network is preliminarily constructed, wherein at least one mapping reduction parameter is used as an input parameter, an optimal configuration parameter to be predicted is used as an output parameter, and historical data of a big data system is used as a training sample set. The historical data of the big data system is specifically historical working states provided by an administrator. Preferably, the at least one mapping convention parameter may be selected from one or more of the following table 1 of important parameters. In a specific application, according to different situations, 20 parameters which have the largest influence on the system are obtained from a system administrator and added into the input and output list, and the selected parameters are shown in the following table 1. The number of the at least one mapping reduction parameter is preferably 2-20.

Table 1 important parameter table

And step S102, taking the mapping reduction time as a measurement standard of the deep neural network, and adjusting the weight of each layer of neurons based on the parameter learning rule of the back propagation idea until the mapping reduction time meets the time cost requirement. In the step, the time cost of MapReduce is used as a final measuring standard of the network structure, and the structure is continuously fed back and adjusted to obtain the final structure of the deep neural network.

Subsequently, a configuration parameter prediction step is performed in steps S103 to S104, and a configuration parameter that optimizes the work effect is predicted using the obtained deep neural network. The method comprises the following specific steps:

step S103, setting an initial value of the at least one mapping protocol parameter, and reading the current test data.

And step S104, inputting the initial value of the at least one mapping specification parameter and the current test data into the deep neural network obtained in the neural network training step to obtain the configuration parameters of the big data system based on deep learning.

Therefore, after Map (Map) tasks and Reduce (Reduce) task parameters are initialized, the deep neural network is introduced, the training set source is a historical task log, the historical parameters are learned, semi-supervised learning is carried out, and the parameters in the deep neural network are obtained through the feedback of the known historical working state and the working performance, so that the configuration parameters of the optimal working performance can be predicted and optimized according to different programs and different input data.

Please refer to fig. 2, which is a flowchart illustrating a neural network training step in the method according to the preferred embodiment of the invention. As shown in fig. 2, the neural network training step includes:

first, in step S201, the flow starts;

subsequently, in step S202, a deep neural network is preliminarily constructed. The deep neural network is a common deep neural network that utilizes back propagation. Specifically, a five-layer deep neural network with a mapping specification parameter as an input parameter is constructed in the step, an optimal configuration parameter to be predicted is used as an output parameter, and the five-layer network respectively comprises an input layer, an output layer and three hidden layers.

Subsequently, in step S203, the historical data of the big data system is input into the deep neural network as a training sample set. Inputting a training sample x, and outputting a hidden layer as x^l＝f(u^l) Wherein u is^l＝W^lx^l-1+b^lWherein, the function f represents the output activation function, W represents the weight, b represents the bias term, and l represents the layer 1. Because parameters in the map and reduce processes cannot be infinitely expanded and have a certain range, b needs to be fixed as the upper limit of the parameters.

Subsequently, in step S204, it is determined whether the map-reduction time satisfies the time cost requirement. Using a square error cost function to measure errors, and assuming that the output parameter class is c and N training samples in the training sample set are shared, mapping the errors E between the reduction time and the specified time cost t^NComprises the following steps:wherein,for the k-th dimension of the target output of the nth training sample,for the k-th dimension of the actual output corresponding to the nth sample, c is 20. And calculating errors among the networks of each layer, and turning to the step S206 to store the deep neural network when the errors are smaller than a preset threshold, or turning to the step S205 to adjust the weight of each layer of neurons.

In step S205, the weight values for each layer of neurons are adjusted. Specifically, in this step, the weight W of each layer of neurons is scaled by the sensitivity δ of the neurons, and finally the weight with the smallest E is obtained:

wherein,and sensitivity of the l-th layer: delta^l＝(W^l+1)^Tδ^l+1οf'(u^l) (ii) a The sensitivity of the neurons of the output layer is: delta^L＝f'(u^L)·(yⁿ-tⁿ) Wherein L represents the total number of layers, yⁿIs the actual output of the nth neuron, tⁿIs the target output of the nth neuron.

In step S206, the deep neural network is saved;

finally, in step S207, the flow of the neural network training step ends.

In the invention, both the protocol mapping time and the configuration parameters are used as output, the protocol mapping time can be understood as the output in the intermediate process, the configuration parameters are the most important output recorded and used by people, the weight is adjusted after the error is compared between the output time and the ideal time, the output of the time is changed after the weight is adjusted, and the output of the configuration parameters is also changed, so that the configuration parameters with the optimal time can be obtained.

Referring to fig. 3, a block diagram of a system for configuring parameter tuning for a deep learning based big data system according to a preferred embodiment of the present invention is shown. As shown in fig. 3, the system 300 for tuning configuration parameters of a deep learning-based big data system according to this embodiment includes a neural network training module 301 and a configuration parameter predicting module 302.

The neural network training module 301 is configured to preliminarily construct a deep neural network, where at least one mapping protocol parameter is used as an input parameter, an optimal configuration parameter to be predicted is used as an output parameter, and historical data of a big data system is used as a training sample set; and the mapping reduction time is taken as a measurement standard of the deep neural network, and the weight of each layer of neurons is adjusted based on the parameter learning rule of the back propagation idea until the mapping reduction time meets the time cost requirement. Preferably, the at least one mapping reduction parameter may be selected from one or more of a re-table. The number of the at least one mapping reduction parameter is preferably 2-20.

Specifically, the neural network training module 301 constructs a five-layer deep neural network using a mapping protocol parameter as an input parameter, and using an optimal configuration parameter to be predicted as an output parameter, where the five-layer deep neural network includes an input layer, an output layer, and three hidden layers, respectively, a training sample x is input, and the hidden layer output is x^l＝f(u^l) Wherein u is^l＝W^lx^l-1+b^lThe function f represents the output activation function, W represents the weight, b represents the bias term, and l represents layer 1.

The neural network training module 301 further measures the error using a squared error cost function, and assuming that the output parameter class is c, and N training samples in the training sample set are total, maps the error E between the reduction time and the specified time cost t^NComprises the following steps:wherein,for the k-th dimension of the target output of the nth training sample,the k-th dimension of the actual output corresponding to the nth sample.

And then calculating the error between each layer of network, saving the deep neural network when the error is smaller than a preset threshold, otherwise, scaling the weight W of each layer of neuron through the sensitivity delta of the neuron:

The configuration parameter prediction module 302 is connected to the neural network training module 301, and is configured to input the set initial value of the at least one mapping protocol parameter and the current test data into the deep neural network obtained through the neural network training step, so as to obtain configuration parameters for the deep learning-based big data system.

In summary, the invention adopts the deep neural network to optimize the configuration parameters in the mapping reduction (MapReduce) framework, avoids the difficult problems of manual adjustment and optimal parameter searching, can obtain the self characteristics and the relationship of each configuration parameter more deeply through the learning of historical parameters, and obtains the parameter configuration most suitable for the application requirements of the application layer through the multiple learning, weight updating and network prediction of the deep network. The invention not only saves the time for adjusting the parameters, but also ensures that the working time of the system is distributed to the compressed and decompressed data by the proper parameters of the system, thereby greatly reducing the writing and transmission time, ensuring that the whole system can work quickly and achieve better working effect. Meanwhile, the method can automatically learn aiming at the input data of different basic layers and the application requirements proposed by application layers, and has strong adaptability.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A big data system configuration parameter tuning method based on deep learning is characterized by comprising a neural network training step and a configuration parameter predicting step; wherein,

the neural network training step comprises the following steps:

the configuration parameter predicting step includes the steps of:

2-2, inputting the initial value of the at least one mapping protocol parameter and the current test data into a deep neural network obtained in the neural network training step to obtain configuration parameters of the big data system based on deep learning;

in the step 1-1:

constructing a five-layer deep neural network with mapping protocol parameters as input parameters, taking optimal configuration parameters to be predicted as output parameters, wherein the five-layer network respectively comprises an input layer, an output layer and three hidden layers, training samples x are input, and the output of the hidden layers is that y is x^l＝f(u^l) Wherein u is^l＝W^lx^l-1+b^lThe function f represents the output activation function, W represents the weight, b represents the bias term, and l represents the secondlA layer;

in the step 1-2:

using a square error cost function to measure errors, and assuming that the output parameter class is c and N training samples in the training sample set are shared, mapping the errors E between the reduction time and the specified time cost t^NComprises the following steps:wherein,for the k-th dimension of the target output of the nth training sample,the k dimension of the actual output corresponding to the nth sample;

calculating the error between each layer of network, saving the deep neural network when the error is smaller than a preset threshold, otherwise, scaling the weight W of each layer of neuron through the sensitivity delta of the neuron:

wherein,and sensitivity of the l-th layer:the sensitivity of the neurons of the output layer is:wherein L represents the total number of layers, yⁿIs the actual output of the nth neuron, tⁿIs the target output of the nth neuron, signRepresenting a convolution.

2. The big data system configuration parameter tuning method based on deep learning of claim 1, wherein the number of the at least one mapping reduction parameter is 2-20.

3. A big data system configuration parameter tuning system based on deep learning is characterized by comprising a neural network training module and a configuration parameter prediction module; wherein,

the configuration parameter prediction module is used for inputting the set initial value of the at least one mapping protocol parameter and the current test data into a deep neural network obtained through the neural network training step to obtain the configuration parameters of the big data system based on deep learning;

the neural network training module is used for constructing a five-layer deep neural network with a mapping protocol parameter as an input parameter and an optimal configuration parameter to be predicted as an output parameter, the five-layer network comprises an input layer, an output layer and three hidden layers respectively, a training sample x is input, and the output of the hidden layer is x^l＝f(u^l) Wherein u is^l＝W^lx^l-1+b^lWherein, the function f represents the output activation function, W represents the weight, b represents the bias term, and l represents the l-th layer;

the neural network training module measures errors by using a square error cost function, and supposing that the output parameter class is c and N training samples in the training sample set are common, the error E between the mapping reduction time and the specified time cost t is^NComprises the following steps:wherein,for the k-th dimension of the target output of the nth training sample,the k dimension of the actual output corresponding to the nth sample;

4. The deep learning based big data system configuration parameter tuning system according to claim 3, wherein the number of the at least one mapping reduction parameter is 2-20.