CN107423823B

CN107423823B - R language-based machine learning modeling platform architecture design method

Info

Publication number: CN107423823B
Application number: CN201710684578.7A
Authority: CN
Inventors: 竹登虎; 勇萌哲
Original assignee: Chengdu Youe Data Co ltd
Current assignee: Chengdu Youe Data Co ltd
Priority date: 2017-08-11
Filing date: 2017-08-11
Publication date: 2020-11-10
Anticipated expiration: 2037-08-11
Also published as: CN107423823A

Abstract

The invention discloses a machine learning modeling platform architecture design method based on R language, which comprises the steps of constructing a visual machine learning operator based on R language, distributing the R operator in the machine learning operator to different Hadoop cluster computing nodes by using an Oozie component, calling data managed by an HDFS component by the Hadoop cluster computing nodes, and computing according to the logical relation of the machine learning operator to obtain the final result of the machine learning operator; the method is adopted to realize the distributed computation of the R language-based visual machine learning operator, so that the modeling platform has the R language-rich machine learning operator and an efficient and flexible programming system, the R operator is adaptively scheduled to different Hadoop cluster computing nodes by the Oozie flow control component, and the balance of cluster loads and the multi-user high-capacity concurrent modeling computation are realized.

Description

R language-based machine learning modeling platform architecture design method

Technical Field

The invention belongs to the field of big data analysis and processing, and particularly relates to a machine learning modeling platform architecture design method based on an R language, which is used for performing distributed computation on a machine learning operator.

Background

The big data analysis processing platform is based on a distributed computing architecture and a machine learning operator and is used for solving the problem of data mining modeling under the condition of super-large data scale. However, in the actual use process of the platform, it is found that the input data and modeling requirements of small data size are main use forms, and the distributed processing architecture has no obvious efficiency advantage for processing the small input data size, but has a more obvious data interaction delay problem; meanwhile, limited by the number of machine learning operators currently supporting distributed computing, the modeling capability of the platform under a complete distributed processing architecture is not as powerful as that under a single machine condition.

The R language is a common single-machine modeling tool in the field of data mining, has rich machine learning operators and a high-efficiency flexible programming system, and can greatly enhance the operator richness of the platform, improve the modeling execution efficiency under small data volume and solve the problem of limited access capacity of a multi-user platform under single-machine if the advantages of the R language can be combined with the distributed architecture of the current platform.

However, the machine learning operator of the R language is a single-machine program and can only operate on one computer, and if different R operators are allocated to different computers to operate, data transmission between the operators related in the front and back cannot be executed; meanwhile, the R operator does not support reading of distributed data at present, and output results cannot be automatically stored in a distributed cluster. If a plurality of users are concentrated on a certain server to execute machine learning tasks, the load of the current server is possibly out of standard, and the use experience is influenced.

Disclosure of Invention

The invention aims to: the R language-based machine learning modeling platform architecture design method is provided, and the technical problem that distributed calculation cannot be performed on R operators is solved.

The technical scheme adopted by the invention is as follows:

a machine learning modeling platform architecture design method based on an R language is characterized in that a visual machine learning operator based on the R language is built, the R operator in the machine learning operator is distributed to different Hadoop cluster computing nodes by using an Oozie component, the Hadoop cluster computing nodes call data managed by an HDFS component and calculate according to the logical relation of the machine learning operator to obtain the final result of the machine learning operator.

Further, the method is realized by the following specific steps:

s201: building a visual machine learning operator based on an R language by using a modeling platform, wherein the machine learning operator comprises n R operators, and the data flow of the n R operators flows from the 1 st operator to the nth operator;

s202: dynamically distributing n R operators to different Hadoop cluster computing nodes by using an Oozie component;

s203: a computing node where the 1 st operator is located downloads a data source from a modeling data source managed by the HDFS component, calls a local R operating environment to execute a data processing function of the 1 st operator, and uploads a computing result to a temporary path TmpPath managed by the HDFS component after computing is completed;

s204: sequentially calculating the 2 nd operator to the n-1 st operator, namely downloading data from the temporary path TmpPath by the calculation node corresponding to each R operator, calling a local R running environment to execute the data processing function of the R operator, uploading the calculation result to the temporary path TmpPath after the calculation is finished, wherein the uploaded calculation result covers the data stored in the temporary path TmpPath each time;

s205: and downloading data from the temporary path TmpPath by the computing node where the nth operator is located, calling a local R operating environment to execute the data processing function of the nth operator, uploading a computing result to a ModelPath managed by the HDFS component after computing is finished, wherein the data stored in the ModelPath is the final computing result of the machine learning operator.

Further, in step S201, the step of building a machine learning operator is as follows:

s301: a user adds various R operators on a WEB application modeling platform by using the inherent packaging format of the R operator, and classifies the R operators according to functions;

s302: the WEB application modeling platform sets a classification catalogue according to a classification result, and performs visual management and display;

s303: and freely dragging the n R operators in the classification directory to a workflow editing area, and connecting lines according to a certain logical relation to complete the construction of the machine learning operator.

Further, in step S202, the step of allocating the R operator by the Oozie component is:

s401: compiling a Shell script file for each R operator, and receiving configuration parameters of the R operator in the modeling platform;

s402: uploading all Shell script files to a storage address of an Oozie component configured in an HDFS component;

s403: according to the logical relation of the machine learning operators, the scheduling unit of the Oozie component generates an Oozie scheduling configuration file for calling each R operator to correspond to the Shell script file;

s404: and starting the Oozie component, and finishing scheduling distribution of the Shell script files corresponding to the R operators in the Hadoop cluster computing node according to the Oozie scheduling configuration file.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

1. and realizing cluster cooperative computing of the R language-based visual machine learning operator.

2. The modeling platform has a machine learning operator with rich R language and a high-efficiency flexible programming system, the R operator is adaptively scheduled to different Hadoop cluster computing nodes by the Oozie flow control component, and the balance of cluster loads and multi-user high-capacity concurrent modeling and computing are achieved.

3. And integrating the open-source machine learning operators rich in R language, and enhancing the supporting strength of the platform machine learning algorithm.

And 4, the HDFS realizes data sharing among operators, so that after one operator is executed, an operator on the next node can obtain the output processing result and then perform corresponding calculation, and the integrity and the continuity of the flow among the single machine learning operators in the cluster environment are realized.

5. The common R operators and the like are visually packaged, a user is allowed to graphically combine any operator according to different application mining requirements, the workload of manually compiling and developing an R script by the user is saved, rich model evaluation functions and model storage functions are provided, and the efficient completion of R modeling function development and model secondary sharing is realized.

Drawings

The invention will now be described, by way of example, with reference to the accompanying drawings, in which:

FIG. 1 is an overall architecture diagram of the present invention;

fig. 2 is a flow chart of the present invention.

Detailed Description

All of the features disclosed in this specification, or all of the steps in any method or process so disclosed, may be combined in any combination, except combinations of features and/or steps that are mutually exclusive.

The present invention will be described in detail with reference to fig. 1 and 2.

Detailed description of the preferred embodiment 1

A machine learning modeling platform architecture design method based on R language comprises the following steps:

step 1: a user adds various R operators on a modeling platform by utilizing the inherent packaging format of the R operator, and classifies the R operators according to functions, wherein the modeling platform is a WEB modeling platform; the WEB application modeling platform sets a classification catalogue according to a classification result, and performs visual management and display; freely dragging n R operators in the classified catalog to a workflow editing area, and connecting lines according to a certain logical relation to complete the construction of machine learning operators; and the data flow of the n R operators flows from the 1 st operator to the n th operator.

Step 2: compiling a Shell script file for each R operator, receiving the configuration parameters of the R operator in the modeling platform, completing the calling of the corresponding R operator in the Shell script file and transmitting the received configuration parameters to the R operator; copying Shell script files corresponding to all R operators to the same public path of the Hadoop cluster computing nodes, and installing an R running environment on each computing node; uploading all Shell script files to a storage address of an Oozie component configured in an HDFS component; according to the logical relationship of machine learning operators in the current platform, the scheduling unit generates an Oozie scheduling configuration file for calling the Shell script files corresponding to the R operators, each scheduling unit is composed of Shell actions, and an Oozie component completes the transmission and function activation of operator parameters by calling the Shell script files corresponding to the R operators; and starting the Oozie component, and finishing scheduling distribution of the Shell script files corresponding to the R operators in the Hadoop cluster computing node according to the Oozie scheduling configuration file.

And step 3: a computing node where the 1 st operator is located downloads a data source from a modeling data source managed by the HDFS component, calls a local R operating environment to execute a data processing function of the 1 st operator, and uploads a computing result to a temporary path TmpPath managed by the HDFS component after computing is completed; sequentially calculating the 2 nd operator to the n-1 st operator, namely downloading data from the temporary path TmpPath by the calculation node corresponding to each operator, calling a local R running environment to execute the data processing function of the operator, uploading the calculation result to the temporary path TmpPath after the calculation is finished, wherein the uploaded calculation result covers the data stored in the temporary path TmpPath each time; and downloading data from the temporary path TmpPath by the computing node where the nth operator is located, calling a local R operating environment to execute the data processing function of the nth operator, and uploading a computing result to a ModelPath managed by the HDFS component after computing is completed.

And 4, step 4: the method comprises the steps that an Oozie component periodically initiates heartbeat monitoring, if the process that an nth operator is located is monitored to be finished, an instruction that a machine learning process is finished is returned, the modeling platform displays that the process is finished, data stored under ModelPath are final calculation results of the machine learning operator, a progress bar is arranged on the modeling platform and used for displaying the calculation progress of the machine learning operator, and after the progress bar displays that the model file and the like generated by modeling can be downloaded under a link provided by the modeling platform.

Specific example 2

Step 1: adding various R operators to the WEB application modeling platform by a user according to the inherent packaging format of the R operator, and classifying according to functions; the WEB application modeling platform sets a classification catalogue according to a classification result, and performs visual management and display; drag-free 3R operators in the classification directory: an operator A, an operator B and an operator C are connected to the workflow editing area according to a certain logical relation, and the construction of a machine learning operator is completed; and the data flow of the 3R operators is from operator A to operator B to operator C.

Step 2: the method comprises the steps of dynamically distributing 3 operators to different Hadoop cluster computing nodes by using an Oozie component, wherein the operator A is distributed to the computing node A, the operator B is distributed to the computing node B, and the operator C is distributed to the computing node C.

And step 3: the calculation node A downloads a data source from a modeling data source managed by the HDFS component, calls a data processing function of a local R operation environment execution operator A, and uploads a calculation result to a temporary path TmpPath managed by the HDFS component after calculation is finished; the computing node B downloads data from the temporary path TmpPath, calls a data processing function of a local R operating environment execution operator B, uploads a computing result to the temporary path TmpPath after computing is finished, and the uploaded computing result covers data stored in the temporary path TmpPath; and the computing node C downloads data from the temporary path TmpPath, calls the data processing function of the local R operating environment execution operator C, and uploads a computing result to the ModelPath managed by the HDFS component after computing is completed.

The working principle of the invention is as follows: the method comprises the steps of building machine learning operators on a platform, installing R operating environments on each Hadoop cluster computing node, compiling Shell script files of each R operator, copying the Shell script files to each computing node, uploading the Shell script files to a storage address managed by an HDFS (Hadoop distributed file system) component, scheduling and distributing the Shell script files corresponding to each R operator in the cluster by the Oozie component, downloading data from the HDFS by the computing nodes for computing, and after computing of the R operator located at the tail end of a data stream in the machine learning operators is completed, taking the computing result of the R operator as the final computing result of the machine learning operators.

Claims

1. A machine learning modeling platform architecture design method based on R language is characterized in that: building a visual machine learning operator based on an R language, distributing the R operator in the machine learning operator to different Hadoop cluster computing nodes by using an Oozie component, calling data managed by an HDFS component by the Hadoop cluster computing nodes, and computing according to the logical relation of the machine learning operator to obtain a final result of the machine learning operator;

the method comprises the following specific steps:

2. The R language-based machine learning modeling platform architecture design method according to claim 1, characterized in that: in step S201, the step of building a machine learning operator is as follows:

3. The R language-based machine learning modeling platform architecture design method according to claim 1, characterized in that: in step S202, the step of allocating the R operator by the Oozie component is: